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Preface 


This book is designed for a short course on machine learning. It is a short 
course, not a hurried course. From over a decade of teaching this material, we 
have distilled what we believe to be the core topics that every student of the 
subject should know. We chose the title ‘learning from data’ that faithfully 
describes what the subject is about, and made it a point to cover the topics in 
a story-like fashion. Our hope is that the reader can learn all the fundamentals 
of the subject by reading the book cover to cover. 

Learning from data has distinct theoretical and practical tracks. If you 
read two books that focus on one track or the other, you may feel that you 
are reading about two different subjects altogether. In this book, we balance 
the theoretical and the practical, the mathematical and the heuristic. Our 
criterion for inclusion is relevance. Theory that establishes the conceptual 
framework for learning is included, and so are heuristics that impact the per- 
formance of real learning systems. Strengths and weaknesses of the different 
parts are spelled out. Our philosophy is to say it like it is: what we know, 
what we don’t know, and what we partially know. 

The book can be taught in exactly the order it is presented. The notable 
exception may be Chapter 2, which is the most theoretical chapter of the book. 
The theory of generalization that this chapter covers is central to learning 
from data, and we made an effort to make it accessible to a wide readership. 
However, instructors who are more interested in the practical side may skim 
over it, or delay it until after the practical methods of Chapter 3 are taught. 

You will notice that we included exercises (in gray boxes) throughout the 
text. The main purpose of these exercises is to engage the reader and enhance 
understanding of a particular topic being covered. Our reason for separating 
the exercises out is that they are not crucial to the logical flow. Nevertheless, 
they contain useful information, and we strongly encourage you to read them, 
even if you don’t do them to completion. Instructors may find some of the 
exercises appropriate as ‘easy’ homework problems, and we also provide ad- 
ditional problems of varying difficulty in the Problems section at the end of 
each chapter. 

To help instructors with preparing their lectures based on the book, we 
provide supporting material on the book’s website (AMLbook.com). There is 
also a forum that covers additional topics in learning from data. We will 
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discuss these further in the Epilogue of this book. 
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Chapter 1 


The Learning Problem 


If you show a picture to a three-year-old and ask if there is a tree in it, you will 
likely get the correct answer. If you ask a thirty-year-old what the definition 
of a tree is, you will likely get an inconclusive answer. We didn’t learn what 
a tree is by studying the mathematical definition of trees. We learned it by 
looking at trees. In other words, we learned from ‘data’. 

Learning from data is used in situations where we don’t have an analytic 
solution, but we do have data that we can use to construct an empirical solu- 
tion. This premise covers a lot of territory, and indeed learning from data is 
one of the most widely used techniques in science, engineering, and economics, 
among other fields. 

In this chapter, we present examples of learning from data and formalize 
the learning problem. We also discuss the main concepts associated with 
learning, and the different paradigms of learning that have been developed. 


1.1 Problem Setup 


What do financial forecasting, medical diagnosis, computer vision, and search 
engines have in common? They all have successfully utilized learning from 
data. The repertoire of such applications is quite impressive. Let us open the 
discussion with a real-life application to see how learning from data works. 

Consider the problem of predicting how a movie viewer would rate the 
various movies out there. This is an important problem if you are a company 
that rents out movies, since you want to recommend to different viewers the 
movies they will like. Good recommender systems are so important to business 
that the movie rental company Netflix offered a prize of one million dollars to 
anyone who could improve their recommendations by a mere 10%. 

The main difficulty in this problem is that the criteria that viewers use to 
rate movies are quite complex. Trying to model those explicitly is no easy task, 
so it may not be possible to come up with an analytic solution. However, we 
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Figure 1.1: A model for how a viewer rates a movie 


know that the historical rating data reveal a lot about how people rate movies, 
so we may be able to construct a good empirical solution. There is a great 
deal of data available to movie rental companies, since they often ask their 
viewers to rate the movies that they have already seen. 


Figure 1.1 illustrates a specific approach that was widely used in the 
million-dollar competition. Here is how it works. You describe a movie as 
a long array of different factors, e.g., how much comedy is in it, how com- 
plicated is the plot, how handsome is the lead actor, etc. Now, you describe 
each viewer with corresponding factors; how much do they like comedy, do 
they prefer simple or complicated plots, how important are the looks of the 
lead actor, and so on. How this viewer will rate that movie is now estimated 
based on the match/mismatch of these factors. For example, if the movie is 
pure comedy and the viewer hates comedies, the chances are he won’t like it. 
If you take dozens of these factors describing many facets of a movie’s content 
and a viewer’s taste, the conclusion based on matching all the factors will be 
a good predictor of how the viewer will rate the movie. 


The power of learning from data is that this entire process can be auto- 
mated, without any need for analyzing movie content or viewer taste. To do 
so, the learning algorithm ‘reverse-engineers’ these factors based solely on pre- 
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vious ratings. It starts with random factors, then tunes these factors to make 
them more and more aligned with how viewers have rated movies before, until 
they are ultimately able to predict how viewers rate movies in general. The 
factors we end up with may not be as intuitive as ‘comedy content’, and in 
fact can be quite subtle or even incomprehensible. After all, the algorithm is 
only trying to find the best way to predict how a viewer would rate a movie, 
not necessarily explain to us how it is done. This algorithm was part of the 
winning solution in the million-dollar competition. 


1.1.1 Components of Learning 


The movie rating application captures the essence of learning from data, and 
so do many other applications from vastly different fields. In order to abstract 
the common core of the learning problem, we will pick one application and 
use it as a metaphor for the different components of the problem. Let us take 
credit approval as our metaphor. 

Suppose that a bank receives thousands of credit card applications every 
day, and it wants to automate the process of evaluating them. Just as in the 
case of movie ratings, the bank knows of no magical formula that can pinpoint 
when credit should be approved, but it has a lot of data. This calls for learning 
from data, so the bank uses historical records of previous customers to figure 
out a good formula for credit approval. 

Each customer record has personal information related to credit, such as 
annual salary, years in residence, outstanding loans, etc. The record also keeps 
track of whether approving credit for that customer was a good idea, i.e., did 
the bank make money on that customer. This data guides the construction of 
a successful formula for credit approval that can be used on future applicants. 

Let us give names and symbols to the main components of this learning 
problem. There is the input x (customer information that is used to make 
a credit decision), the unknown target function f: ¥ — Y (ideal formula for 
credit approval), where ¥ is the input space (set of all possible inputs x), and Y 
is the output space (set of all possible outputs, in this case just a yes/no deci- 
sion). There is a data set D of input-output examples (x1, y1) ° , (XN, YN), 
where yn = f (Xn) for n = 1,..., N (inputs corresponding to previous customers 
and the correct credit decision for them in hindsight). The examples are often 
referred to as data points. Finally, there is the learning algorithm that uses the 
data set D to pick a formula g: ¥ — Y that approximates f. The algorithm 
chooses g from a set of candidate formulas under consideration, which we call 
the hypothesis set H. For instance, H could be the set of all linear formulas 
from which the algorithm would choose the best linear fit to the data, as we 
will introduce later in this section. 

When a new customer applies for credit, the bank will base its decision 
on g (the hypothesis that the learning algorithm produced), not on f (the 
ideal target function which remains unknown). The decision will be good only 
to the extent that g faithfully replicates f. To achieve that, the algorithm 
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Figure 1.2: Basic setup of the learning problem 


chooses g that best matches f on the training examples of previous customers, 
with the hope that it will continue to match f on new customers. Whether 
or not this hope is justified remains to be seen. Figure 1.2 illustrates the 
components of the learning problem. 


Exercise 1.1 


Express each of the following tasks in the framework of learning from data by 
specifying the input space X, output space V, target function f: ¥ > V, 
and the specifics of the data set that we will learn from. 
(a) Medical diagnosis: A patient walks in with a medical history and some 
symptoms, and you want to identify the problem. 


(b) Handwritten digit recognition (for example postal zip code recognition 
for mail sorting). 


(c) Determining if an email is spam or not. 


(d) Predicting how an electric load varies with price, temperature, and 
day of the week. 


(e) A problem of interest to you for which there is no analytic solution, 
but you have data from which to construct an empirical solution. 
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We will use the setup in Figure 1.2 as our definition of the learning problem. 
Later on, we will consider a number of refinements and variations to this basic 
setup as needed. However, the essence of the problem will remain the same. 
There is a target to be learned. It is unknown to us. We have a set of examples 
generated by the target. The learning algorithm uses these examples to look 
for a hypothesis that approximates the target. 


1.1.2 A Simple Learning Model 


Let us consider the different components of Figure 1.2. Given a specific learn- 
ing problem, the target function and training examples are dictated by the 
problem. However, the learning algorithm and hypothesis set are not. These 
are solution tools that we get to choose. The hypothesis set and learning 
algorithm are referred to informally as the learning model. 

Here is a simple model. Let ¥ = Rt be the input space, where Rt is the 
d-dimensional Euclidean space, and let Y = {+1,—1} be the output space, 
denoting a binary (yes/no) decision. In our credit example, different coor- 
dinates of the input vector x € R? correspond to salary, years in residence, 
outstanding debt, and the other data fields in a credit application. The bi- 
nary output y corresponds to approving or denying credit. We specify the 
hypothesis set H through a functional form that all the hypotheses h € H 
share. The functional form h(x) that we choose here gives different weights to 
the different coordinates of x, reflecting their relative importance in the credit 
decision. The weighted coordinates are then combined to form a ‘credit score’ 
and the result is compared to a threshold value. If the applicant passes the 
threshold, credit is approved; if not, credit is denied: 


d 
Approve credit if J` wiz; > threshold, 


i=} 


d 
Deny credit if >> wiz; < threshold. 


i=1 


This formula can be written more compactly as 


d 
h(x) = val (Jo wi + +) (1.1) 


where z1, ,Za are the components of the vector x; h(x) = +1 means ‘ap- 
prove credit’ and h(x) = —1 means ‘deny credit’; sign(s) = +1 if s > 0 and 
sign(s) = —1 if s < 0.1. The weights are w1,---,wa, and the threshold is 


determined by the bias term b since in Equation (1.1), credit is approved if 


4 wx; > —b. 
This model of H is called the perceptron, a name that it got in the context 
of artificial intelligence. The learning algorithm will search H by looking for 


1 The value of sign(s) when s = 0 is a simple technicality that we ignore for the moment. 
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(a) Misclassified data (b) Perfectly classified data 


Figure 1.3: Perceptron classification of linearly separable data in a two- 
dimensional input space (a) Some training examples will be misclassified 
(blue points in red region and vice versa) for certain values of the weight 
parameters which define the separating line. (b) A final hypothesis that 
classifies all training examples correctly. (o is +1 and x is —1.) 


weights and bias that perform well on the data set. Some of the weights 
W1,°** , Wq may end up being negative, corresponding to an adverse effect on 
credit approval. For instance, the weight of the ‘outstanding debt’ field should 
come out negative since more debt is not good for credit. The bias value b 
may end up being large or small, reflecting how lenient or stringent the bank 
should be in extending credit. The optimal choices of weights and bias define 
the final hypothesis g € H that the algorithm produces. 


Exercise 1.2 
Suppose that we use a perceptron to detect spam messages. Let's say 


that each email message is represented by the frequency of occurrence of 
keywords, and the output is +1 if the message is considered spam. 


(a) Can you think of some keywords that will end up with a large positive 
weight in the perceptron? 


(b) How about keywords that will get a negative weight? 


(c) What parameter in the perceptron directly affects how many border- 
line messages end up being classified as spam? 


Figure 1.3 illustrates what a perceptron does in a two-dimensional case (d = 2). 
The plane is split by a line into two regions, the +1 decision region and the —1 
decision region. Different values for the parameters w1, w2,b correspond to 
different lines w 21 + we%2 +b = 0. If the data set is linearly separable, there 
will be a choice for these parameters that classifies all the training examples 
correctly. 
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To simplify the notation of the perceptron formula, we will treat the bias b 
as a weight wo = b and merge it with the other weights into one vector 
w = [wo,W1,°:: , Wal", where T denotes the transpose of a vector, so w is a 
column vector. We also treat x as a column vector and modify it to become x = 
[20,21,°°+ , Lal", where the added coordinate Zo is fixed at zo = 1. Formally 
speaking, the input space is now 


X = {1} x RÊ = {[z0, £1; Zal" | £o = 1, z1 ER, ,ta ER}. 


With this convention, w7x = ye wizi, and so Equation (1.1) can be rewrit- 
ten in vector form as 


h(x) = sign(w*x). (1.2) 


We now introduce the perceptron learning algorithm (PLA). The algorithm 
will determine what w should be, based on the data. Let us assume that the 
data set is linearly separable, which means that there is a vector w that 
makes (1.2) achieve the correct decision h(x,) = Yn on all the training exam- 
ples, as shown in Figure 1.3. 


Our learning algorithm will find this w using a simple iterative method. 
Here is how it works. At iteration t, where t = 0,1,2,... , there is a current 
value of the weight vector, call it w(t). The algorithm picks an example 
from (x1, y1) +++ (XN, yn) that is currently misclassified, call it (x(t), y(¢)), and 
uses it to update w(t). Since the example is misclassified, we have y(t) Æ 
sign(w7(t)x(t)). The update rule is 


w(t +1) = w(t) + y(t)x(t). (1.3) 








This rule moves the boundary in the direction of classifying x(t) correctly, as 
depicted in the figure above. The algorithm continues with further iterations 
until there are no longer misclassified examples in the data set. 
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Exercise 1.3 


The weight update rule in (1.3) has the nice interpretation that it moves 
in the direction of classifying x(t) correctly. 


(a) Show that y(t)w’*(t)x(t) < 0. [Hint: x(t) is misclassified by w(t).] 
(b) Show that y(t)w™ (t+ 1)x(¢) > y(t)w’ (t)x(t). [Hint: Use (1.3).] 


(c) As far as classifying x(t) is concerned, argue that the move from w(t) 
to w(t+ 1) is a move ‘in the right direction’. 


Although the update rule in (1.3) considers only one training example at a 
time and may ‘mess up’ the classification of the other examples that are not 
involved in the current iteration, it turns out that the algorithm is guaranteed 
to arrive at the right solution in the end. The proof is the subject of Prob- 
lem 1.3. The result holds regardless of which example we choose from among 
the misclassified examples in (x1,y1)-::(xv,yn) at each iteration, and re- 
gardless of how we initialize the weight vector to start the algorithm. For 
simplicity, we can pick one of the misclassified examples at random (or cycle 
through the examples and always choose the first misclassified one), and we 
can initialize w(0) to the zero vector. 

Within the infinite space of all weight vectors, the perceptron algorithm 
manages to find a weight vector that works, using a simple iterative process. 
This illustrates how a learning algorithm can effectively search an infinite 
hypothesis set using a finite number of simple steps. This feature is character- 
istic of many techniques that are used in learning, some of which are far more 
sophisticated than the perceptron learning algorithm. 


Exercise 1.4 


Let us create our own target function f and data set D and see how the 
perceptron learning algorithm works. Take d = 2 so you can visualize the 
problem, and choose a random line in the plane as your target function, 
where one side of the line maps to +1 and the other maps to —1. Choose 
the inputs x, of the data set as random points in the plane, and evaluate 
the target function on each xn to get the corresponding output yn. 


Now, generate a data set of size 20. Try the perceptron learning algorithm 
on your data set and see how long it takes to converge and how well the 
final hypothesis g matches your target f. You can find other ways to play 
with this experiment in Problem 1.4. 


The perceptron learning algorithm succeeds in achieving its goal; finding a hy- 
pothesis that classifies all the points in the data set D = {(x1, y1) ++: (xw,yn)} 
correctly. Does this mean that this hypothesis will also be successful in classi- 
fying new data points that are not in D? This turns out to be the key question 
in the theory of learning, a question that will be thoroughly examined in this 
book. 
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Figure 1.4: The learning approach to coin classification (a) Training data of 
pennies, nickels, dimes, and quarters (1, 5, 10, and 25 cents) are represented 
in a size-mass space where they fall into clusters. (b) A classification rule is 
learned from the data set by separating the four clusters. A new coin will 
be classified according to the region in the size-mass plane that it falls into. 


1.1.8 Learning versus Design 


So far, we have discussed what learning is. Now, we discuss what it is not. The 
goal is to distinguish between learning and a related approach that is used for 
similar problems. While learning is based on data, this other approach does 
not use data. It is a ‘design’ approach based on specifications, and is often 
discussed alongside the learning approach in pattern recognition literature. 

Consider the problem of recognizing coins of different denominations, which 
is relevant to vending machines, for example. We want the machine to recog- 
nize quarters, dimes, nickels and pennies. We will contrast the ‘learning from 
data’ approach and the ‘design from specifications’ approach for this prob- 
lem. We assume that each coin will be represented by its size and mass, a 
two-dimensional input. 

In the learning approach, we are given a sample of coins from each of 
the four denominations and we use these coins as our data set. We treat 
the size and mass as the input vector, and the denomination as the output. 
Figure 1.4(a) shows what the data set may look like in the input space. There 
is some variation of size and mass within each class, but by and large coins 
of the same denomination cluster together. The learning algorithm searches 
for a hypothesis that classifies the data set well. If we want to classify a new 
coin, the machine measures its size and mass, and then classifies it according 
to the learned hypothesis in Figure 1.4(b). 

In the design approach, we call the United States Mint and ask them about 
the specifications of different coins. We also ask them about the number 
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Figure 1.5: The design approach to coin classification (a) A probabilistic 
model for the size, mass, and denomination of coins is derived from known 
specifications. The figure shows the high probability region for each denom- 
ination (1, 5, 10, and 25 cents) according to the model. (b) A classification 
rule is derived analytically to minimize the probability of error in classifying 
a coin based on size and mass. The resulting regions for each denomination 
are shown. 


of coins of each denomination in circulation, in order to get an estimate of 
the relative frequency of each coin. Finally, we make a physical model of 
the variations in size and mass due to exposure to the elements and due to 
errors in measurement. We put all of this information together and compute 
the full joint probability distribution of size, mass, and coin denomination 
(Figure 1.5(a)). Once we have that joint distribution, we can construct the 
optimal decision rule to classify coins based on size and mass (Figure 1.5(b)). 
The rule chooses the denomination that has the highest probability for a given 
size and mass, thus achieving the smallest possible probability of error.” 


The main difference between the learning approach and the design ap- 
proach is the role that data plays. In the design approach, the problem is well 
specified and one can analytically derive f without the need to see any data. 
In the learning approach, the problem is much less specified, and one needs 
data to pin down what f is. 

Both approaches may be viable in some applications, but only the learning 
approach is possible in many applications where the target function is un- 
known. We are not trying to compare the utility or the performance of the 
two approaches. We are just making the point that the design approach is 
distinct from learning. This book is about learning. 


*This is called Bayes optimal decision theory. Some learning models are based on the 
same theory by estimating the probability from data. 
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Exercise 1.5 


Which of the following problems are more suited for the learning approach 
and which are more suited for the design approach? 


(a) Determining the age at which a particular medical test should be 
performed 


(b) Classifying numbers into primes and non-primes 
(c) Detecting potential fraud in credit card charges 
(d) Determining the time it would take a falling object to hit the ground 


(e) Determining the optimal cycle for traffic lights in a busy intersection 


1.2 Types of Learning 


The basic premise of learning from data is the use of a set of observations to 
uncover an underlying process. It is a very broad premise, and difficult to fit 
into a single framework. As a result, different learning paradigms have arisen 
to deal with different situations and different assumptions. In this section, we 
introduce some of these paradigms. 

The learning paradigm that we have discussed so far is called supervised 
learning. It is the most studied and most utilized type of learning, but it is 
not the only one. Some variations of supervised learning are simple enough 
to be accommodated within the same framework. Other variations are more 
profound and lead to new concepts and techniques that take on lives of their 
own. The most important variations have to do with the nature of the data 
set. 


1.2.1 Supervised Learning 


When the training data contains explicit examples of what the correct output 
should be for given inputs, then we are within the supervised learning set- 
ting that we have covered so far. Consider the hand-written digit recognition 
problem (task (b) of Exercise 1.1). A reasonable data set for this problem is 
a collection of images of hand-written digits, and for each image, what the 
digit actually is. We thus have a set of examples of the form ( image , digit ). 
The learning is supervised in the sense that some ‘supervisor’ has taken the 
trouble to look at each input, in this case an image, and determine the correct 
output, in this case one of the ten categories {0, 1, 2,3, 4, 5,6, 7,8, 9}. 

While we are on the subject of variations, there is more than one way that 
a data set can be presented to the learning process. Data sets are typically cre- 
ated and presented to us in their entirety at the outset of the learning process. 
For instance, historical records of customers in the credit-card application, 
and previous movie ratings of customers in the movie rating application, are 
already there for us to use. This protocol of a ‘ready’ data set is the most 
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common in practice, and it is what we will focus on in this book. However, it 
is worth noting that two variations of this protocol have attracted a significant 
body of work. 

One is active learning, where the data set is acquired through queries that 
we make. Thus, we get to choose a point x in the input space, and the 
supervisor reports to us the target value for x. As you can see, this opens 
the possibility for strategic choice of the point x to maximize its information 
value, similar to asking a strategic question in a game of 20 questions. 

Another variation is called online learning, where the data set is given to 
the algorithm one example at a time. This happens when we have stream- 
ing data that the algorithm has to process ‘on the run’. For instance, when 
the movie recommendation system discussed in Section 1.1 is deployed, on- 
line learning can process new ratings from current users and movies. Online 
learning is also useful when we have limitations on computing and storage 
that preclude us from processing the whole data as a batch. We should note 
that online learning can be used in different paradigms of learning, not just in 
supervised learning. 


1.2.2 Reinforcement Learning 


When the training data does not explicitly contain the correct output for each 
input, we are no longer in a supervised learning setting. Consider a toddler 
learning not to touch a hot cup of tea. The experience of such a toddler 
would typically comprise a set of occasions when the toddler confronted a hot 
cup of tea and was faced with the decision of touching it or not touching it. 
Presumably, every time she touched it, the result was a high level of pain, and 
every time she didn’t touch it, a much lower level of pain resulted (that of an 
unsatisfied curiosity). Eventually, the toddler learns that she is better off not 
touching the hot cup. 

The training examples did not spell out what the toddler should have done, 
but they instead graded different actions that she has taken. Nevertheless, she 
uses the examples to reinforce the better actions, eventually learning what she 
should do in similar situations. This characterizes reinforcement learning, 
where the training example does not contain the target output, but instead 
contains some possible output together with a measure of how good that out- 
put is. In contrast to supervised learning where the training examples were of 
the form ( input , correct output ), the examples in reinforcement learning are 
of the form 

( input , some output , grade for this output ). 


Importantly, the example does not say how good other outputs would have 
been for this particular input. 

Reinforcement learning is especially useful for learning how to play a game. 
Imagine a situation in backgammon where you have a choice between different 
actions and you want to identify the best action. It is not a trivial task to 
ascertain what the best action is at a given stage of the game, so we cannot 
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Figure 1.6: Unsupervised learning of coin classification (a) The same data 
set of coins in Figure 1.4(a) is again represented in the size-mass space, but 
without being labeled. They still fall into clusters. (b) An unsupervised 
classification rule treats the four clusters as different types. The rule may 
be somewhat ambiguous, as type 1 and type 2 could be viewed as one cluster 


easily create supervised learning examples. If you use reinforcement learning 
instead, all you need to do is to take some action and report how well things 
went, and you have a training example. The reinforcement learning algorithm 
is left with the task of sorting out the information coming from different ex- 
amples to find the best line of play. 


1.2.3 Unsupervised Learning 


In the unsupervised setting, the training data does not contain any output 
information at all. We are just given input examples x;,-:: ,xy. You may 
wonder how we could possibly learn anything from mere inputs. Consider the 
coin classification problem that we discussed earlier in Figure 1.4. Suppose 
that we didn’t know the denomination of any of the coins in the data set. This 
unlabeled data is shown in Figure 1.6(a). We still get similar clusters, but they 
are now unlabeled so all points have the same ‘color’. ‘The decision regions 
in unsupervised learning may be identical to those in supervised learning, but 
without the labels (Figure 1.6(b)). However, the correct clustering is less 
obvious now, and even the number of clusters may be ambiguous. 
Nonetheless, this example shows that we can learn something from the 
inputs by themselves. Unsupervised learning can be viewed as the task of 
spontaneously finding patterns and structure in input data. For instance, if 
our task is to categorize a set of books into topics, and we only use general 
properties of the various books, we can identify books that have similar prop- 
erties and put them together in one category, without naming that category. 
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Unsupervised learning can also be viewed as a way to create a higher- 
level representation of the data. Imagine that you don’t speak a word of 
Spanish, but your company will relocate you to Spain next month. They 
will arrange for Spanish lessons once you are there, but you would like to 
prepare yourself a bit before you go. All you have access to is a Spanish radio 
station. For a full month, you continuously bombard yourself with Spanish; 
this is an unsupervised learning experience since you don’t know the meaning 
of the words. However, you gradually develop a better representation of the 
language in your brain by becoming more tuned to its common sounds and 
structures. When you arrive in Spain, you will be in a better position to start 
your Spanish lessons. Indeed, unsupervised learning can be a precursor to 
supervised learning. In other cases, it is a stand-alone technique. 


Exercise 1.6 

For each of the following tasks, identify which type of learning is involved 
(supervised, reinforcement, or unsupervised) and the training data to be 
used. If a task can fit more than one type, explain how and describe the 
training data for each type. 


(a) Recommending a book to a user in an online bookstore 
(b) Playing tic-tac-toe 

(c) Categorizing movies into different types 

(d) Learning to play music 


(e) Credit limit: Deciding the maximum allowed debt for each bank cus- 
tomer 


Our main focus in this book will be supervised learning, which is the most 
popular form of learning from data. 


1.2.4 Other Views of Learning 


The study of learning has evolved somewhat independently in a number of 
fields that started historically at different times and in different domains, and 
these fields have developed different emphases and even different jargons. As a 
result, learning from data is a diverse subject with many aliases in the scientific 
literature. The main field dedicated to the subject is called machine learning, 
a name that distinguishes it from human learning. We briefly mention two 
other important fields that approach learning from data in their own ways. 
Statistics shares the basic premise of learning from data, namely the use 
of a set of observations to uncover an underlying process. In this case, the 
process is a probability distribution and the observations are samples from that 
distribution. Because statistics is a mathematical field, emphasis is given to 
situations where most of the questions can be answered with rigorous proofs. 
As a result, statistics focuses on somewhat idealized models and analyzes them 
in great detail. This is the main difference between the statistical approach 
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Figure 1.7: A visual learning problem. The first two rows show the training 
examples (each input x is a 9-bit vector represented visually as a 3 x 3 black- 
and-white array). The inputs in the first row have f(x) = —1, and the inputs 
in the second row have f(x) = +1. Your task is to learn from this data set 
what f is, then apply f to the test input at the bottom. Do you get —1 
or +1? 


to learning and how we approach the subject here. We make less restrictive 
assumptions and deal with more general models than in statistics. Therefore, 
we end up with weaker results that are nonetheless broadly applicable. 

Data mining is a practical field that focuses on finding patterns, correla- 
tions, or anomalies in large relational databases. For example, we could be 
looking at medical records of patients and trying to detect a cause-effect re- 
lationship between a particular drug and long-term effects. We could also be 
looking at credit card spending patterns and trying to detect potential fraud. 
Technically, data mining is the same as learning from data, with more empha- 
sis on data analysis than on prediction. Because databases are usually huge, 
computational issues are often critical in data mining. Recommender systems, 
which were illustrated in Section 1.1 with the movie rating example, are also 
considered part of data mining. 


1.3 Is Learning Feasible? 


The target function f is the object of learning. The most important assertion 

about the target function is that it is unknown. We really mean unknown. 
This raises a natural question. How could a limited data set reveal enough 

information to pin down the entire target function? Figure 1.7 illustrates this 
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difficulty. A simple learning task with 6 training examples of a +1 target 
function is shown. Try to learn what the function is then apply it to the test 
input given. Do you get —1 or +1? Now, show the problem to your friends 
and see if they get the same answer. 

The chances are the answers were not unanimous, and for good reason. 
There is simply more than one function that fits the 6 training examples, and 
some of these functions have a value of —1 on the test point and others have a 
value of +1. For instance, if the true f is +1 when the pattern is symmetric, 
the value for the test point would be +1. If the true f is +1 when the top left 
square of the pattern is white, the value for the test point would be —1. Both 
functions agree with all the examples in the data set, so there isn’t enough 
information to tell us which would be the correct answer. 

This does not bode well for the feasibility of learning. To make matters 
worse, we will now see that the difficulty we experienced in this simple problem 
is the rule, not the exception. 


1.3.1 Outside the Data Set 


When we get the training data D, e.g., the first two rows of Figure 1.7, we 
know the value of f on all the points in D. This doesn’t mean that we have 
learned f, since it doesn’t guarantee that we know anything about f outside 
of D. We know what we have already seen, but that’s not learning. That’s 
memorizing. 

Does the data set D tell us anything outside of D that we didn’t know 
before? If the answer is yes, then we have learned something. If the answer is 
no, we can conclude that learning is not feasible. 

Since we maintain that f is an unknown function, we can prove that f 
remains unknown outside of D. Instead of going through a formal proof for 
the general case, we will illustrate the idea in a concrete case. Consider a 
Boolean target function over a three-dimensional input space ¥ = {0,1}°. 
We are given a data set D of five examples represented in the table below. We 
denote the binary output by o/e for visual clarity, 





where y, = f(X,) for n = 1,2,3,4,5. The advantage of this simple Boolean 
case is that we can enumerate the entire input space (since there are only 2° = 8 
distinct input vectors), and we can enumerate the set of all possible target 
functions (since f is a Boolean function on 3 Boolean inputs, and there are 


only 22° = 256 distinct Boolean functions on 3 Boolean inputs). 
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Let us look at the problem of learning f. Since f is unknown except 
inside D, any function that agrees with D could conceivably be f. The table 
below shows all such functions f1,--- , fg. It also shows the data set D (in 
blue) and what the final hypothesis g may look like. 
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The final hypothesis g is chosen based on the five examples in D. The table 
shows the case where g is chosen to match f on these examples. 

If we remain true to the notion of unknown target, we cannot exclude any 
of fi,- , fg from being the true f. Now, we have a dilemma. The whole 
purpose of learning f is to be able to predict the value of f on points that we 
haven’t seen before. The quality of the learning will be determined by how 
clọse our prediction is to the true value. Regardless of what g predicts on 
the three points we haven’t seen before (those outside of D, denoted by red 
question marks), it can agree or disagree with the target, depending on which 
of fı,- , fg turns out to be the true target. It is easy to verify that any 3 
bits that replace the red question marks are as good as any other 3 bits. 


Exercise 1.7 


For each of the following learning scenarios in the above problem, evaluate 
the performance of g on the three points in Æ outside D. To measure the 
performance, compute how many of the 8 possible target functions agree 
with g on all three points, on two of them, on one of them, and on none 
of them. 


(a) H has only two hypotheses, one that always returns ‘e’ and one that 
always returns ‘o’. The learning algorithm picks the hypothesis that 
matches the data set the most. 


(b) The same H, but the learning algorithm now picks the hypothesis 
that matches the data set the least. 


(c) H = {XOR} (only one hypothesis which is always picked), where 
XOR is defined by XOR(x) = e if the number of 1’s in x is odd and 
XOR(x) = 0 if the number is even. 


(d) contains all possible hypotheses (all Boolean functions on three 
variables), and the learning algorithm picks the hypothesis that agrees 
with all training examples, but otherwise disagrees the most with the 
XOR. 
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v = fraction of red marbles 






u = probability of red marbles 


Figure 1.8: A random sample is picked from a bin of red and green marbles. 
The probability of red marbles in the bin is unknown. What does the 
fraction v of red marbles in the sample tell us about u? 


It doesn’t matter what the algorithm does or what hypothesis set H is used. 
Whether has a hypothesis that perfectly agrees with D (as depicted in the 
table) or not, and whether the learning algorithm picks that hypothesis or 
picks another one that disagrees with D (different green bits), it makes no 
difference whatsoever as far as the performance outside of D is concerned. Yet 
the performance outside D is all that matters in learning! 

This dilemma is not restricted to Boolean functions, but extends to the 
general learning problem. As long as f is an unknown function, knowing D 
cannot exclude any pattern of values for f outside of D. Therefore, the pre- 
dictions of g outside of D are meaningless. 

Does this mean that learning from data is doomed? If so, this will be a 
very short book ©). Fortunately, learning is alive and well, and we will see 
why. We won’t have to change our basic assumption to do that. The target 
function will continue to be unknown, and we still mean unknown. 


1.3.2 Probability to the Rescue 


We will show that we can indeed infer something outside D using only D, but 
in a probabilistic way. What we infer may not be much compared to learning 
a full target function, but it will establish the principle that we can reach 
outside D. Once we establish that, we will take it to the general learning 
problem and pin down what we can and cannot learn. 

Let’s take the simplest case of picking a sample, and see when we can say 
something about the objects outside the sample. Consider a bin that contains 
red and green marbles, possibly infinitely many. The proportion of red and 
green marbles in the bin is such that if we pick a marble at random, the 
probability that it will be red is w and the probability that it will be green 
is 1 — u. We assume that the value of u is unknown to us. 
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We pick a random sample of N independent marbles (with replacement) 
from this bin, and observe the fraction v of red marbles within the sample 
(Figure 1.8). What does the value of v tell us about the value of u? 

One answer is that regardless of the colors of the N marbles that we picked, 
we still don’t know the color of any marble that we didn’t pick. We can get 
mostly green marbles in the sample while the bin has mostly red marbles. 
Although this is certainly possible, it is by no means probable. 


Exercise 1.8 
If u = 0.9, what is the probability that a sample of 10 marbles will have 
vy < 0.1? [Hints: 1. Use binomial distribution. 2. The answer is a very 
small number.] 


The situation is similar to taking a poll. A random sample from a population 
tends to agree with the views of the population at large. The probability 
distribution of the random variable v in terms of the parameter p is well 
understood, and when the sample size is big, v tends to be close to u. 

To quantify the relationship between v and u, we use a simple bound called 
the Hoeffding Inequality. It states that for any sample size N, 


P[jv—pl>ed< jor for any € > 0. (1.4) 


Here, P[-] denotes the probability of an event, in this case with respect to 
the random sample we pick, and € is any positive value we choose. Putting 
Inequality (1.4) in words, it says that as the sample size N grows, it becomes 
exponentially unlikely that v will deviate from p by more than our ‘tolerance’ e. 

The only quantity that is random in (1.4) is v which depends on the random 
sample. By contrast, u is not random. It is just a constant, albeit unknown to 
us. There is a subtle point here. The utility of (1.4) is to infer the value of u 
using the value of v, although it is u that affects v, not vice versa. However, 
since the effect is that v tends to be close to p, we infer that u ‘tends’ to be 
close to v. 

Although P [|v — u| > e] depends on p, as p appears in the argument and 
also affects the distribution of v, we are able to bound the probability by 2e~2° 
which does not depend on u. Notice that only the size N of the sample affects 
the bound, not the size of the bin. The bin can be large or small, finite or 
infinite, and we still get the same bound when we use the same sample size. 


Exercise 1.9 


If u = 0.9, use the Hoeffding Inequality to bound the probability that a 
sample of 10 marbles will have v < 0.1 and compare the answer to the 
previous exercise. 


If we choose € to be very small in order to make v a good approximation of p, 
we need a larger sample size N to make the RHS of Inequality (1.4) small. We 
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can then assert that it is likely that v will indeed be a good approximation of u. 
Although this assertion does not give us the exact value of jz, and doesn’t even 
guarantee that the approximate value holds, knowing that we are within +e 
of u most of the time is a significant improvement over not knowing anything 
at all. 

The fact that the sample was randomly selected from the bin is the reason 
we are able to make any kind of statement about u being close to v. If the 
sample was not randomly selected but picked in a particular way, we would 
lose the benefit of the probabilistic analysis and we would again be in the dark 
outside of the sample. 

How does the bin model relate to the learning problem? It seems that the 
unknown here was just the value of u while the unknown in learning is an entire 
function f: ¥ — VY. The two situations can be connected. Take any single 
hypothesis h € H and compare it to f on each point x € ¥. If h(x) = f(x), 
color the point x green. If h(x) Æ f(x), color the point x red. The color 
that each point gets is not known to us, since f is unknown. However, if we 
pick x at random according to some probability distribution P over the input 
space X, we know that x will be red with some probability, call it 4, and green 
with probability 1 — u. Regardless of the value of u, the space X now behaves 
like the bin in Figure 1.8. 

The training examples play the role of a sample from the bin. If the 
inputs x1,:::,Xy in D are picked independently according to P, we will get 
a random sample of red (h(xn) Æ f(Xn)) and green (h(x,) = f(xn)) points. 
Each point will be red with probability u and green with probability 1— u. The 
color of each point will be known to us since both h(x,) and f(x,) are known 
for n = 1,---,N (the function h is our hypothesis so we can evaluate it on 
any point, and f(x,) = Yn is given to us for all points in the data set D). The 
learning problem is now reduced to a bin problem, under the assumption that 
the inputs in D are picked independently according to some distribution P 
on X. Any P will translate to some p in the equivalent bin. Since p is 
allowed to be unknown, P can be unknown to us as well. Figure 1.9 adds this 
probabilistic component to the basic learning setup depicted in Figure 1.2. 

With this equivalence, the Hoeffding Inequality can be applied to the learn- 
ing problem, allowing us to make a prediction outside of D. Using v to pre- 
dict u tells us something about f, although it doesn’t tell us what f is. What u 
tells us is the error rate h makes in approximating f. If v happens to be close 
to zero, we can predict that h will approximate f well over the entire input 
space. If not, we are out of luck. 

Unfortunately, we have no control over v in our current situation, since v 
is based on a particular hypothesis h. In real learning, we explore an entire 
hypothesis set H, looking for some h € H that has a small error rate. If we 
have only one hypothesis to begin with, we are not really learning, but rather 
‘verifying’ whether that particular hypothesis is good or bad. Let us see if we 
can extend the bin equivalence to the case where we have multiple hypotheses 
in order to capture real learning. 
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Figure 1.9: Probability added to the basic learning setup 


To do that, we start by introducing more descriptive names for the dif- 
ferent components that we will use. The error rate within the sample, which 
corresponds to v in the bin model, will be called the in-sample error, 


Ein(h) = (fraction of D where f and h disagree) 


1 N 


Il 


where [statement] = 1 if the statement is true, and = 0 if the statement is 
false. We have made explicit the dependency of Ei, on the particular h that 
we are considering. In the same way, we define the out-of-sample error 


Eout(h) =P [A(x) a f(x) ) 


which corresponds to u in the bin model. The probability is based on the 
distribution P over ¥ which is used to sample the data points x. 
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Figure 1.10: Multiple bins depict the learning problem with M hypotheses 


Substituting the new notation Ein for v and Eout for u, the Hoeffding 
Inequality (1.4) can be rewritten as 


P [|Ein(h) — Eout(h)| >] <2e72°" for any e> 0, (1.5) 


where N is the number of training examples. The in-sample error Fin, just 
like v, is a random variable that depends on the sample. The out-of-sample 
error Four, just like p, is unknown but not random. 

Let us consider an entire hypothesis set instead of just one hypothesis h, 
and assume for the moment that H has a finite number of hypotheses 


H = {hi,ho,-+: hm} 


We can construct a bin equivalent in this case by having M bins as shown in 
Figure 1.10. Each bin still represents the input space 1, with the red marbles 
in the mth bin corresponding to the points x € X where h,,(x) # f(x). The 
probability of red marbles in the mth bin is Eous(hm) and the fraction of 
red marbles in the mth sample is Fin(hm), for m = 1,--- , M. Although the 
Hoeffding Inequality (1.5) still applies to each bin individually, the situation 
becomes more complicated when we consider all the bins simultaneously. Why 
is that? The inequality stated that 


P [|Em(h) — Bout(h)| > e| <2e72"" for any e > 0, 


where the hypothesis h is fixed before you generate the data set, and the 
probability is with respect to random data sets D; we emphasize that the 
assumption “h is fixed before you generate the data set” is critical to the 
validity of this bound. If you are allowed to change h after you generate the 
data set, the assumptions that are needed to prove the Hoeffding Inequality 
no longer hold. With multiple hypotheses in H, the learning algorithm picks 
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the final hypothesis g based on D, i.e. after generating the data set. The 
statement we would like to make is not 


“P[|Ein(hm) — Eout(hm)| > €] is small” 
(for any particular, fixed hm € H), but rather 
“P| Fin(g) — Fout(g)| > €] is small” for the final hypothesis g. 


The hypothesis g is not fixed ahead of time before generating the data, because 
which hypothesis is selected to be g depends on the data. So, we cannot just 
plug in g for h in the Hoeffding inequality. The next exercise considers a simple 
coin experiment that further illustrates the difference between a fixed h and 
the final hypothesis g selected by the learning algorithm. 


Exercise 1.10 


Here is an experiment that illustrates the difference between a single bin 
and multiple bins. Run a computer simulation for flipping 1,000 fair coins. 
Flip each coin independently 10 times. Let's focus on 3 coins as follows: 
cı is the first coin flipped; Crana is a coin you choose at random; Cmin is the 
coin that had the minimum frequency of heads (pick the earlier one in case 
of a tie). Let 11, Vrana and Vmin be the fraction of heads you obtain for the 
respective three coins. 


(a) What is u for the three coins selected? 


(b) Repeat this entire experiment a large number of times (e.g., 100, 000 

runs of the entire experiment) to get several instances of 11, Vrand 

and Vmin and plot the histograms of the distributions of 11, Vrang and 

Vmin. Notice that which coins end up being Crang and Cmin may differ 

from one run to another. 

Using (b), plot estimates for P||v— u| > e] as a function of €, together 

with the Hoeffding bound 2e~2¢ N (on the same graph). 

(d) Which coins obey the Hoeffding bound, and which ones do not? Ex- 
plain why. 

(e) Relate part (d) to the multiple bins in Figure 1.10. 


(c 


— 


The way to get around this is to try to bound P||Fin(g) — Eout(g)| > €] in 
a way that does not depend on which g the learning algorithm picks. There 
is a simple but crude way of doing that. Since g has to be one of the h,,’s 
regardless of the algorithm and the sample, it is always true that 


“|Fin(g) — Eout(g)| >? => “ |Fin(hi) — out (h1)| > € 
or | Bin (he) — Fout(he)| >€ 


or |Ein(hm) 2 Eout(hm )| >e”. 
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where Bı => Bo means that event Bı implies event B2. Although the events 
on the RHS cover a lot more than the LHS, the RHS has the property we want; 
the hypotheses Am are fixed. We now apply two basic rules in probability; 


if By => Bg, then P [B1] < P[Bgl, 
and, if B,,Bo,--- ,Bm are any events, then 
P(By or By or =: or Bul < P[81] + P[B2] aS i P[Bm]. 


The second rule is known as the union bound. Putting the two rules together, 
we get 


P[ |Fin(g) — Eout(9)| >€] < P|  |Eimn(hi)— Eout(h1)| > € 
or |Fin(he) — Eout(ha)| > € 


or |Fin(haz) — Eout(has)| > € | 
M 
< X P [||En(hm) — Eout(hm)| > e. 


m=1 


Applying the Hoeffding Inequality (1.5) to the M terms one at a time, we can 
bound each term in the sum by ge720 Substituting, we get 


P[|Ein(g) — Eout(g)| > | < 2Me7?°%. (1.6) 


Mathematically, this is a ‘uniform’ version of (1.5). We are trying to simul- 
taneously approximate all Four(hm)’s by the corresponding Ein(Am}’s. This 
allows the learning algorithm to choose any hypothesis based on Ein and ex- 
pect that the corresponding Foyt will uniformly follow suit, regardless of which 
hypothesis is chosen. 

The downside for uniform estimates is that the probability bound 2Me 
is a factor of M looser than the bound for a single hypothesis, and will only 
be meaningful if M is finite. We will improve on that in Chapter 2. 


—267N 


1.3.3 Feasibility of Learning 


We have introduced two apparently conflicting arguments about the feasibility 
of learning. One argument says that we cannot learn anything outside of D, 
and the other says that we can. We would like to reconcile these two arguments 
and pinpoint the sense in which learning is feasible: 


1. Let us reconcile the two arguments. The question of whether D tells us 
anything outside of D that we didn’t know before has two different answers. 
If we insist on a deterministic answer, which means that D tells us something 
certain about f outside of D, then the answer is no. If we accept a probabilistic 
answer, which means that D tells us something likely about f outside of D, 
then the answer is yes. 
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Exercise 1.11 


We are given a data set D of 25 training examples from an unknown target 
function f: X > Y, where X =R and Y = {—1, +1}. To learn f, we use 
a simple hypothesis set H = {h1, h2} where hı is the constant +1 function 
and hə is the constant —1. 


We consider two learning algorithms, S (smart) and C (crazy). S chooses 
the hypothesis that agrees the most with D and C chooses the other hy- 
pothesis deliberately. Let us see how these algorithms perform out of sam- 
ple from the deterministic and probabilistic points of view. Assume in 
the probabilistic view that there is a probability distribution on 4’, and let 
Pie) =+1] =p 

(a) Can S produce a hypothesis that is guaranteed to perform better than 

random on any point outside D? 


(b) Assume for the rest of the exercise that all the examples in D have 
Yn = +1. Is it possible that the hypothesis that C produces turns out 
to be better than the hypothesis that S produces? 


(c) If p = 0.9, what is the probability that S will produce a better hy- 
pothesis than C? 


(d) Is there any value of p for which it is more likely than not that C will 
produce a better hypothesis than S? 


By adopting the probabilistic view, we get a positive answer to the feasibility 
question without paying too much of a price. The only assumption we make 
in the probabilistic framework is that the examples in D are generated inde- 
pendently. We don’t insist on using any particular probability distribution, 
or even on knowing what distribution is used. However, whatever distribu- 
tion we use for generating the examples, we must also use when we evaluate 
how well g approximates f (Figure 1.9). That’s what makes the Hoeffding 
Inequality applicable. Of course this ideal situation may not always happen 
in practice, and some variations of it have been explored in the literature. 


2. Let us pin down what we mean by the feasibility of learning. Learning pro- 
duces a hypothesis g to approximate the unknown target function f. If learning 
is successful, then g should approximate f well, which means Fou.(g) œ O. 
However, this is not what we get from the probabilistic analysis. What we 
get instead is Eout(g) © Ein(g). We still have to make Fin(g) ~% 0 in order to 
conclude that Eout(g) œ 0. 

We cannot guarantee that we will find a hypothesis that achieves E;,(g) + 0, 
but at least we will know if we find it. Remember that Four(g) is an unknown 
quantity, since f is unknown, but Ein(g) is a quantity that we can evaluate. 
We have thus traded the condition Four(g) ~ 0, one that we cannot ascertain, 
for the condition Fin(g) ~ 0, which we can ascertain. What enabled this is 
the Hoeffding Inequality (1.6): 


Pl|Ein(g) — Eout(g)| > €] < 2Me72" 
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that assures us that Eout(g) © Ein(g) so we can use Fin as a proxy for Eout. 


Exercise 1.12 
A friend comes to you with a learning problem. She says the target func- 
tion f is completely unknown, but she has 4,000 data points. She is 
willing to pay you to solve her problem and produce for her a g which 
approximates f. What is the best that you can promise her among the 
following: 
(a) After learning you will provide her with a g that you will guarantee 
approximates f well out of sample. 
(b) After learning you will provide her with a g, and with high probability 
the g which you produce will approximate f well out of sample. 


(c) One of two things will happen. 


(i) You will produce a hypothesis g; 
(ii) You will declare that you failed. 


If you do return a hypothesis g, then with high probability the g which 
you produce will approximate f well out of sample. 


One should note that there are cases where we won’t insist that Ei,(g) + 0. 
Financial forecasting is an example where market unpredictability makes it 
impossible to get a forecast that has anywhere near zero error. All we hope 
for is a forecast that gets it right more often than not. If we get that, our 
bets will win in the long run. This means that a hypothesis that has Fin (g) 
somewhat below 0.5 will work, provided of course that Fout(g) is close enough 
to Ein (g). 


The feasibility of learning is thus split into two questions: 


1. Can we make sure that Eout(g) is close enough to FEin(g)? 


2. Can we make FEin(g) small enough? 








The Hoeffding Inequality (1.6) addresses the first question only. The second 
question is answered after we run the learning algorithm on the actual data 
and see how small we can get Ej, to be. 

Breaking down the feasibility of learning into these two questions provides 
further insight into the role that different components of the learning problem 
play. One such insight has to do with the ‘complexity’ of these components. 


The complexity of H. If the number of hypotheses M goes up, we run 
more risk that Ein(g) will be a poor estimator of Eout(g) according to In- 
equality (1.6). M can be thought of as a measure of the ‘complexity’ of the 
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hypothesis set H that we use. If we want an affirmative answer to the first 
question, we need to keep the complexity of H in check. However, if we want 
an affirmative answer to the second question, we stand a better chance if H 
is more complex, since g has to come from H. So, a more complex gives us 
more flexibility in finding some g that fits the data well, leading to small E;,(g). 
This tradeoff in the complexity of H is a major theme in learning theory that 
we will study in detail in Chapter 2. 


The complexity of f. Intuitively, a complex target function f should be 
harder to learn than a simple f. Let us examine if this can be inferred from 
the two questions above. A close look at Inequality (1.6) reveals that the 
complexity of f does not affect how well Fin(g) approximates Eour(g). If 
we fix the hypothesis set and the number of training examples, the inequality 
provides the same bound whether we are trying to learn a simple f (for instance 
a constant function) or a complex f (for instance a highly nonlinear function). 
However, this doesn’t mean that we can learn complex functions as easily as 
we learn simple functions. Remember that (1.6) affects the first question only. 
If the target function is complex, the second question comes into play since 
the data from a complex f are harder to fit than the data from a simple f. 
This means that we will get a worse value for Ei,(g) when f is complex. We 
might try to get around that by making our hypothesis set more complex so 
that we can fit the data better and get a lower Fin (g), but then Eout won’t be 
as close to Ein per (1.6). Either way we look at it, a complex f is harder to 
learn as we expected. In the extreme case, if f is too complex, we may not be 
able to learn it at all. 


Fortunately, most target functions in real life are not too complex; we can 
learn them from a reasonable D using a reasonable H. This is obviously a 
practical observation, not a mathematical statement. Even when we cannot 
learn a particular f, we will at least be able to tell that we can’t. As long as 
we make sure that the complexity of H gives us a good Hoeffding bound, our 
success or failure in learning f can be determined by our success or failure in 
fitting the training data. 


1.4 Error and Noise 


We close this chapter by revisiting two notions in the learning problem in order 
to bring them closer to the real world. The first notion is what approximation 
means when we say that our hypothesis approximates the target function 
well. The second notion is about the nature of the target function. In many 
situations, there is noise that makes the output of f not uniquely determined 
by the input. What are the ramifications of having such a ‘noisy’ target on 
the learning problem? 
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1.4.1 Error Measures 


Learning is not expected to replicate the target function perfectly. The final 
hypothesis g is only an approximation of f. To quantify how well g approxi- 
mates f, we need to define an error measure? that quantifies how far we are 
from the target. 

The choice of an error measure affects the outcome of the learning process. 
Different error measures may lead to different choices of the final hypothesis, 
even if the target and the data are the same, since the value of a particular error 
measure may be small while the value of another error measure in the same 
situation is large. Therefore, which error measure we use has consequences 
for what we learn. What are the criteria for choosing one error measure over 
another? We address this question here. 

First, let’s formalize this notion a bit. An error measure quantifies how 
well each hypothesis A in the model approximates the target function f, 


Error = E(h, f). 


While E(h, f) is based on the entirety of h and f, it is almost universally de- 
fined based on the errors on individual input points x. If we define a pointwise 
error measure e(h(x), f(x)), the overall error will be the average value of this 
pointwise error. So far, we have been working with the classification error 
e(h(x), fx) = Rx) A F]. 

In an ideal world, ÆE(h, f) should be user-specified. The same learning task 
in different contexts may warrant the use of different error measures. One may 
view E(h, f) as the ‘cost’ of using h when you should use f. This cost depends 
on what h is used for, and cannot be dictated just by our learning techniques. 
Here is a case in point. 


Example 1.1 (Fingerprint verification). Consider the problem of verifying 
that a fingerprint belongs to a particular person. What is the appropriate 
error measure? 


+1 you 


—] intruder 





The target function takes as input a fingerprint, and returns +1 if it belongs 
to the right person, and —1 if it belongs to an intruder. 


3This measure is also called an error function in the literature, and sometimes the error 
is referred to as cost, objective, or risk. 
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There are two types of error that our hypothesis h can make here. If the 
correct person is rejected (h = —1 but f = +1), it is called false reject, and if 





an incorrect person is accepted (h = +1 but f = —1), it is called false accept. 
f 
+1 —1 
+1 no error false accept 
h l 
—1 | false reject no error 


How should the error measure be defined in this problem? If the right person 
is accepted or an intruder is rejected, the error is clearly zero. We need to 
specify the error values for a false accept and for a false reject. The right 
values depend on the application. 

Consider two potential clients of this fingerprint system. One is a super- 
market who will use it at the checkout counter to verify that you are a member 
of a discount program. The other is the CIA who will use it at the entrance 
to a secure facility to verify that you are authorized to enter that facility. 

For the supermarket, a false reject is costly because if a customer gets 
wrongly rejected, she may be discouraged from patronizing the supermarket 
in the future. All future revenue from this annoyed customer is lost. On the 
other hand, the cost of a false accept is minor. You just gave away a discount 
to someone who didn’t deserve it, and that person left their fingerprint in your 
system — they must be bold indeed. 

For the CIA, a false accept is a disaster. An unauthorized person will gain 
access to a highly sensitive facility. This should be reflected in a much higher 
cost for the false accept. False rejects, on the other hand, can be tolerated 
since authorized persons are employees (rather than customers as with the 
supermarket). The inconvenience of retrying when rejected is just part of the 
job, and they must deal with it. 

The costs of the different types of errors can be tabulated in a matrix. For 
our examples, the matrices might look like: 





Supermarket CIA 


These matrices should be used to weight the different types of errors when 
we compute the total error. When the learning algorithm minimizes a cost- 
weighted error measure, it automatically takes into consideration the utility 
of the hypothesis that it will produce. In the supermarket and CIA scenarios, 
this could lead to two completely different final hypotheses. E 


The moral of this example is that the choice of the error measure depends 
on how the system is going to be used, rather than on any inherent criterion 
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Figure 1.11: The general (supervised) learning problem 


that we can independently determine during the learning process. However, 
this ideal choice may not be possible in practice for two reasons. One is 
that the user may not provide an error specification, which is not uncommon. 
The other is that the weighted cost may be a difficult objective function for 
optimizers to work with. Therefore, we often look for other ways to define the 
error measure, sometimes with purely practical or analytic considerations in 
mind. We have already seen an example of this with the simple binary error 
used in this chapter, and we will see other error measures in later chapters. 


1.4.2 Noisy Targets 


In many practical applications, the data we learn from are not generated by 
a deterministic target function. Instead, they are generated in a noisy way 
such that the output is not uniquely determined by the input. For instance, 
in the credit-card example we presented in Section 1.1, two customers may 
have identical salaries, outstanding loans, etc., but end up with different credit 
behavior. Therefore, the credit ‘function’ is not really a deterministic function, 
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but a noisy one. 

This situation can be readily modeled within the same framework that we 
have. Instead of y = f(x), we can take the output y to be a random variable 
that is affected by, rather than determined by, the input x. Formally, we have 
a target distribution P(y | x) instead of a target function y = f(x). A data 
point (x, y) is now generated by the joint distribution P(x, y) = P(x)P(y | x). 

One can think of a noisy target as a deterministic target plus added noise. 
If y is real-valued for example, one can take the expected value of y given x to 
be the deterministic f(x), and consider y — f(x) as pure noise that is added 
to f. 

This view suggests that a deterministic target function can be considered 
a special case of a noisy target, just with zero noise. Indeed, we can formally 
express any function f as a distribution P(y | x) by choosing P(y | x) to be 
zero for all y except y = f(x). Therefore, there is no loss of generality if we 
consider the target to be a distribution rather than a function. Figure 1.11 
modifies the previous Figures 1.2 and 1.9 to illustrate the general learning 
problem, covering both deterministic and noisy targets. 


Exercise 1.13 


Consider the bin model for a hypothesis h that makes an error with prob- 
ability yz in approximating a deterministic target function f (both h and f 
are binary functions). If we use the same h to approximate a noisy version 
of f given by 

A y = f(x), 

1—-A y# f). 


(a) What is the probability of error that h makes in approximating y? 


Pl») ={ 


(b) At what value of A will the performance of h be independent of u? 
[Hint: The noisy target will look completely random.] 


There is a difference between the role of P(y | x) and the role of P(x) in 
the learning problem. While both distributions model probabilistic aspects 
of x and y, the target distribution P(y | x) is what we are trying to learn, 
while the input distribution P(x) only quantifies the relative importance of 
the point x in gauging how well we have learned. 

Our entire analysis of the feasibility of learning applies to noisy target 
functions as well. Intuitively, this is because the Hoeffding Inequality (1.6) 
applies to an arbitrary, unknown target function. Assume we randomly picked 
all the y’s according to the distribution P(y | x) over the entire input space ¥. 
This realization of P(y | x) is effectively a target function. Therefore, the 
inequality will be valid no matter which particular random realization the 
‘target function’ happens to be. 

This does not mean that learning a noisy target is as easy as learning a 
deterministic one. Remember the two questions of learning? With the same 
learning model, Fou; may be as close to Fin in the noisy case as it is in the 
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deterministic case, but Ei, itself will likely be worse in the noisy case since it 
is hard to fit the noise. 

In Chapter 2, where we prove a stronger version of (1.6), we will assume 
the target to be a probability distribution P(y | x), thus covering the general 


Case. 
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1.5 Problems 


Problem 1.1 We have 2 opaque bags, each containing 2 balls. One bag 
has 2 black balls and the other has a black and a white ball. You pick a bag 
at random and then pick one of the balls in that bag at random. When you 
look at the ball it is black. You now pick the second ball from that same bag. 
What is the probability that this ball is also black? [Hint: Use Bayes’ Theorem: 
P[A and B] = P[A | B] P [B] = P[B | A] P [A].] 


Problem 1.2 Consider the perceptron in two dimensions: h(x) = 
sign(w™x) where w = [wo, w1, wa2]” and x = [1,21,a2]". Technically, x has 
three coordinates, but we call this perceptron two-dimensional because the first 
coordinate is fixed at 1. 


(a) Show that the regions on the plane where h(x) = +1 and h(x) = —1 are 
separated by a line. If we express this line by the equation x2 = axı +b, 
what are the slope a and intercept b in terms of wo, w1, w2? 


(b) Draw a picture for the cases w = [1,2,3]" and w = —[1, 2,3)". 


In more than two dimensions, the +1 and —1 regions are separated by a hy- 
perplane, the generalization of a line. 


Problem 1.3 Prove that the PLA eventually converges to a linear 
separator for separable data. The following steps will guide you through the 
proof. Let w* be an optimal set of weights (one which separates the data). 
The essential idea in this proof is to show that the PLA weights w(t) get “more 
aligned” with w* with every iteration. For simplicity, assume that w(0) = 0. 


(a) Let p = mini<n<n yn(W*'Xn). Show that p > 0. 


(b) Show that w"(t)w* > w*(t—1)w*+~p, and conclude that w” (t)w* > tp. 
[Hint: Use induction.] 


(c) Show that wÆ)? < lw — DI? + [l(t -= DI. 


[Hint: y(t — 1) - (w7 (t — 1)x(t — 1)) < 0 because x(t — 1) was misclas- 
sified by w(t — 1).] 
(d) Show by induction that || w(t) ||? < tR?, where R = maxi<n<n ||Xnll. 


(continued on next page) 
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(e) 


Using (b) and (d), show that 





and hence prove that 


Sg w"(t)w* ? 
[Hint: maaien < L Why?) 


In practice, PLA converges more quickly than the bound iË suggests. 
Nevertheless, because we do not know p in advance, we can’t determine the 
number of iterations to convergence, which does pose a problem if the data is 
non-separable. 


Problem 1.4 In Exercise 1.4, we use an artificial data set to study the 
perceptron learning algorithm. This problem leads you to explore the algorithm 
further with data sets of different sizes and dimensions. 


(a) 


(b) 


(c) 
(a) 
(e) 
(f) 


(g) 


(h) 


Generate a linearly separable data set of size 20 as indicated in Exer- 
cise 1.4. Plot the examples { (xn, yn)} as well as the target function f on 
a plane. Be sure to mark the examples from different classes differently, 
and add labels to the axes of the plot. 


Run the perceptron learning algorithm on the data set above. Report the 
number of updates that the algorithm takes before converging. Plot the 
examples {(Xn, Yn)}, the target function f, and the final hypothesis g in 
the same figure. Comment on whether f is close to g. 


Repeat everything in (b) with another randomly generated data set of 
size 20. Compare your results with (b). 


Repeat everything in (b) with another randomly generated data set of 
size 100. Compare your results with (b). 


Repeat everything in (b) with another randomly generated data set of 
size 1,000. Compare your results with (b). 


Modify the algorithm such that it takes x, € R*? instead of R?. Ran- 
domly generate a linearly separable data set of size 1,000 with x, € R1? 
and feed the data set to the algorithm. How many updates does the 
algorithm take to converge? 


Repeat the algorithm on the same data set as (f) for 100 experiments. In 
the iterations of each experiment, pick x(t) randomly instead of determin- 
istically. Plot a histogram for the number of updates that the algorithm 
takes to converge. 


Summarize your conclusions with respect to accuracy and running time 
as a function of N and d. 
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Problem 1.5 The perceptron learning algorithm works like this: In each it- 
eration t, pick a random (x(t), y(t)) and compute the ‘signal’ s(t) = w” (t)x(t). 
If y(t) - s(t) < 0, update w by 


w(t +1) <— w(t) + y(t) x(t) ; 


One may argue that this algorithm does not take the ‘closeness’ between s(t) 
and y(t) into consideration. Let's look at another perceptron learning algo- 
rithm: In each iteration, pick a random (x(t), y(t)) and compute s(t). If 
y(t): s(t) < 1, update w by 


w(t + 1) <— w(t) +n: (y(t) — s(t) - x) , 


where 7) is a constant. That is, if s(t) agrees with y(t) well (their product 
is > 1), the algorithm does nothing. On the other hand, if s(t) is further 
from y(t), the algorithm changes w(t) more. In this problem, you are asked to 
implement this algorithm and study its performance. 


(a) Generate a training data set of size 100 similar to that used in Exercise 1.4. 
Generate a test data set of size 10,000 from the same process. To get g, 
run the algorithm above with 7 = 100 on the training data set, until 
a maximum of 1,000 updates has been reached. Plot the training data 
set, the target function f, and the final hypothesis g on the same figure. 
Report the error on the test set. 

(b) Use the data set in (a) and redo everything with 7 = 1. 

(c) Use the data set in (a) and redo everything with 7 = 0.01. 

(d) Use the data set in (a) and redo everything with 7 = 0.0001. 


(e) Compare the results that you get from (a) to (d). 


The algorithm above is a variant of the so-called Adaline (Adaptive Linear 
Neuron) algorithm for perceptron learning. 


Problem 1.6 Consider a sample of 10 marbles drawn independently from 
a bin that holds red and green marbles. The probability of a red marble is ju. 
For u = 0.05, u = 0.5, and u = 0.8, compute the probability of getting no red 
marbles (v = 0) in the following cases. 


(a) We draw only one such sample. Compute the probability that v = 0. 


(b) We draw 1,000 independent samples. Compute the probability that (at 
least) one of the samples has v = 0. 


(c) Repeat (b) for 1,000,000 independent samples. 
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Problem 1.7 A sample of heads and tails is created by tossing a coin 
a number of times independently. Assume we have a number of coins that 
generate different samples independently. For a given coin, let the probability 
of heads (probability of error) be u. The probability of obtaining k heads in N 
tosses of this coin is given by the binomial distribution: 


P[k | N, 4) = a wel p)”. 


Remember that the training error v is E. 


(a) Assume the sample size (N) is 10. If all the coins have u = 0.05 compute 
the probability that at least one coin will have v = 0 for the case of 1 
coin, 1,000 coins, 1,000,000 coins. Repeat for u = 0.8. 


(b) For the case N = 6 and 2 coins with u = 0.5 for both coins, plot the 
probability 
P{max |v; — mi| > €] 


for € in the range [0, 1] (the max is over coins). On the same plot show the 
bound that would be obtained using the Hoeffding Inequality . Remember 
that for a single coin, the Hoeffding bound is 


2 


Pilly — p| gan me 
[Hint: Use P[A or B] = P[A] + P[B] — P[A and B] = P[A] + P[B]— 


P|A]P[B], where the last equality follows by independence, to evaluate 
P{max...]] 


Problem 1.8 The Hoeffding Inequality is one form of the /aw of large 
numbers. One of the simplest forms of that law is the Chebyshev Inequality, 
which you will prove here. 


(a) If ¢ is a non-negative random variable, prove that for any a > 0, 
Plt > a] < E(t)/a. 

(b) If u is any random variable with mean pu and variance o”, prove that for 
anya>0, P|(u-— u} >a] < = [Hint: Use (a)] 


(c) If u1,- , us are iid random variables, each with mean p and variance o°, 
and u = + JÀ; un, prove that for any a > 0, 


Plu-p) Sasa 


Notice that the RHS of this Chebyshev Inequality goes down linearly in N, 
while the counterpart in Hoeffding’s Inequality goes down exponentially. In 
Problem 1.9, we develop an exponential bound using a similar approach. 
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Problem 1.9 _ In this problem, we derive a form of the law of large numbers 
that has an exponential bound, called the Chernoff bound. We focus on the 
simple case of flipping a fair coin, and use an approach similar to Problem 1.8. 


(a) Let ¢ be a (finite) random variable, a be a positive constant, and s be a 
positive parameter. If T(s) = E:(e%), prove that 


Plt >a) ser" Tie). 


[Hint: e* is monotonically increasing in t.] 


(b) Let wi,--+, un be iid random variables, and let u = +A Un. If 
U(s) = Eu, (e°"") (for any n), prove that 


Plu>a] < (e~**U(s)) ~ 
(c) Suppose Plu, = 0] = Plun = 1] = 4 (fair coin). Evaluate U(s) as 


a function of s, and minimize e-°*U(s) with respect to s for fixed a, 
OC oe 1, 


(d) Conclude in (c) that, for0<e< 4, 


Plu > E(u) +e < 29%, 


where 6 = 1+ (4 + e) log.(4 + €) + (4 — €) log, ($ — €) and E(u) = 5. 
Show that 8 > 0, hence the bound is exponentially decreasing in N. 
Problem 1.10 Assume that © = {x1,Xe,...,Xn,XN41,---,Xn+m} 
and Y = {—1, +1} with an unknown target function f: X — V. The training 
data set D is (x1,y1),-:- , (xn, yn). Define the ofFtraining-set error of a 
hypothesis h with respect to f by 
1 & 
Bos(h, f) = 57 J antem) A f(ew+m)]- 
m=1 


(a) Say f(x) = +1 for all x and 


{ +1, for x =X and k is oddand1ł <k < M+N 
h(x) = ; 
—1, otherwise 

What is Eog(h, f)? 

(b) We say that a target function f can ‘generate’ D in a noiseless setting 
if yn = f(Xn) for all (xn, yn) € D. For a fixed D of size N, how many 
possible f: X — Y can generate D in a noiseless setting? 

(c) For a given hypothesis h and an integer k between 0 and M, how many 
of those f in (b) satisfy Eog (h, f) = 4%? 

(d) For a given hypothesis h, if all those f that generate D in a noiseless 
setting are equally likely in probability, what is the expected off-training- 
set error Es [Eos (h, f)]? 


(continued on next page) 
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(e) A deterministic algorithm A is defined as a procedure that takes D as 
an input, and outputs a hypothesis h = A(D). Argue that for any two 
deterministic algorithms Az and Ag, 


Ey |Eor(Ai(D), f)] = Eş [Eo (42(D), f)]. 


You have now proved that in a noiseless setting, for a fixed D, if all possible f 
are equally likely, any two deterministic algorithms are equivalent in terms of the 
expected off-training-set error. Similar results can be proved for more general 
settings. 


Problem 1.11 The matrix which tabulates the cost of various errors for 
the CIA and Supermarket applications in Example 1.1 is called a risk or loss 
matrix. 

For the two risk matrices in Example 1.1, explicitly write down the in-sample 
error Hin that one should minimize to obtain g. This in-sample error should 
weight the different types of errors based on the risk matrix. [Hint: Consider 
Yn = +1 and yn = —1 separately.) 


Problem 1.12 This problem investigates how changing the error measure 
can change the result of the learning process. You have N data points yı < 
+++ < yw and wish to estimate a ‘representative’ value. 


(a) If your algorithm is to find the hypothesis h that minimizes the in-sample 
sum of squared deviations, 


Bin(h) = $h- vn)’, 


n=1 


then show that your estimate will be the in-sample mean, 


1 N 
hmean = N > Yn. 
n=1 


(b) If your algorithm is to find the hypothesis h that minimizes the in-sample 
sum of absolute deviations, 


then show that your estimate will be the in-sample median hmeg, which 
is any value for which half the data points are at most hmea and half the 
data points are at least hmed. 


(c) Suppose yw is perturbed to yn + €, where € —> oo. So, the single data 
point yy becomes an outlier. What happens to your two estimators Amean 
and Ame? 
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Chapter 2 


Training versus Testing 


Before the final exam, a professor may hand out some practice problems and 
solutions to the class. Although these problems are not the exact ones that 
will appear on the exam, studying them will help you do better. They are the 
‘training set’ in your learning. 

If the professor’s goal is to help you do better in the exam, why not give 
out the exam problems themselves? Well, nice try ©). Doing well in the 
exam is not the goal in and of itself. The goal is for you to learn the course 
material. The exam is merely a way to gauge how well you have learned the 
material. If the exam problems are known ahead of time, your performance 
on them will no longer accurately gauge how well you have learned. 

The same distinction between training and testing happens in learning from 
data. In this chapter, we will develop a mathematical theory that characterizes 
this distinction. We will also discuss the conceptual and practical implications 
of the contrast between training and testing. 


2.1 Theory of Generalization 


The out-of-sample error Foy; measures how well our training on D has gener- 
alized to data that we have not seen before. Eout is based on the performance 
over the entire input space X. Intuitively, if we want to estimate the value 
of Eou using a sample of data points, these points must be ‘fresh’ test points 
that have not been used for training, similar to the questions on the final exam 
that have not been used for practice. 

The in-sample error Fin, by contrast, is based on data points that have 
been used for training. It expressly measures training performance, similar to 
your performance on the practice problems that you got before the final exam. 
Such performance has the benefit of looking at the solutions and adjusting 
accordingly, and may not reflect the ultimate performance in a real test. We 
began the analysis of in-sample error in Chapter 1, and we will extend this 
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analysis to the general case in this chapter. We will also make the contrast 
between a training set and a test set more precise. 

A word of warning: this chapter is the heaviest in this book in terms of 
mathematical abstraction. To make it easier on the not-so-mathematically 
inclined, we will tell you which part you can safely skip without ‘losing the 
plot’. The mathematical results provide fundamental insights into learning 
from data, and we will interpret these results in practical terms. 


Generalization error. We have already discussed how the value of Fin 
does not always generalize to a similar value of Foy. Generalization is a key 
issue in learning. One can define the generalization error as the discrepancy 
between Ej, and Fout} The Hoeffding Inequality (1.6) provides a way to 
characterize the generalization error with a probabilistic bound, 


P [|Ein(g) a Eout(g)| > e] < A 


for any e > 0. This can be rephrased as follows. Pick a tolerance level 6, for 
example 6 = 0.05, and assert with probability at least 1 — 6 that 


Fout(g) < Ein(g) + 4/— ln —. (2.1) 


We refer to the type of inequality in (2.1) as a generalization bound because 
it bounds Fout in terms of Ein. To see that the Hoeffding Inequality implies 
this generalization bound, we rewrite (1.6) as follows: with probability at least 
1 — 2Me7?N GS |Eout — Ein| < €, which implies Eout < Ein + €. We may now 
identify 6 = 2Me~2"®’ | from which € = \/ sh7 In %, and (2.1) follows. 

Notice that the other side of |Eout — Fin| < € also holds, that is, Four > 
Ex, — € for all h € H. This is important for learning, but in a more subtle way. 
Not only do we want to know that the hypothesis g that we choose (say the 
one with the best training error) will continue to do well out of sample (i.e., 
Eout < Ein + €), but we also want to be sure that we did the best we could 
with our H (no other hypothesis h € H has Eour(h) significantly better than 
Eout(g)). The Eour(h) > Ein(h) — € direction of the bound assures us that 
we couldn’t do much better because every hypothesis with a higher Fi, than 
the g we have chosen will have a comparably higher Fout. 


The error bound 4/ si In 244 in (2.1), or ‘error bar’ if you will, depends 


on M, the size of the hypothesis set H. If H is an infinite set, the bound goes 
to infinity and becomes meaningless. Unfortunately, almost all interesting 
learning models have infinite H, including the simple perceptron which we 
discussed in Chapter 1. 

In order to study generalization in such models, we need to derive a coun- 
terpart to (2.1) that deals with infinite H. We would like to replace M with 








t Sometimes ‘generalization error’ is used as another name for Eout, but not in this book. 
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something finite, so that the bound is meaningful. To do this, we notice that 
the way we got the M factor in the first place was by taking the disjunction 
of events: 


“| Fin (h1) =z Eout(h1)| >” or 
“| Fin (ho) — Eout(ha)| > © or 


“| Fin (has) = Fout(hu)| > ; (2.2) 


which is guaranteed to include the event “|Ein (g) — Eout(g)| > @ since g is al- 
ways one of the hypotheses in H. We then over-estimated the probability using 
the union bound. Let Bm be the (Bad) event that “| Ein (Am) — Eout(Am)| > e. 
Then, 


P[6: or Ba or -:: or Burl < P[B,| + P[82] eee P[By]. 


If the events 81, Bo,--- , Bm are strongly 
overlapping, the union bound becomes par- 
ticularly loose as illustrated in the figure to 
the right for an example with 3 hypotheses; 
the areas of different events correspond to 
their probabilities. The union bound says 
that the total area covered by 51, B2, or B3 
is smaller than the sum of the individual ar- 
eas, which is true but is a gross overestimate 
when the areas overlap heavily as in this ex- 
ample. The events “|Fin(hm) — Eout(hm)| > 
e’; m =1,---,M, are often strongly overlap- 
ping. If hy is very similar to hg for instance, 
the two events “|Fin(h1) — Fout(hi)| > e and “|Fin(h2) — Eout(he)| > e’ are 
likely to coincide for most data sets. In a typical learning model, many hy- 
potheses are indeed very similar. If you take the perceptron model for instance, 
as you slowly vary the weight vector w, you get infinitely many hypotheses 
that differ from each other only infinitesimally. 

The mathematical theory of generalization hinges on this observation. 
Once we properly account for the overlaps of the different hypotheses, we 
will be able to replace the number of hypotheses M in (2.1) by an effective 
number which is finite even when M is infinite, and establish a more useful 
condition under which Eout is close to Fin. 





2.1.1 Effective Number of Hypotheses 


We now introduce the growth function, the quantity that will formalize the 
effective number of hypotheses. The growth function is what will replace M 
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in the generalization bound (2.1). It is a combinatorial quantity that cap- 
tures how different the hypotheses in H are, and hence how much overlap the 
different events in (2.2) have. 

We will start by defining the growth function and studying its basic prop- 
erties. Next, we will show how we can bound the value of the growth function. 
Finally, we will show that we can replace M in the generalization bound with 
the growth function. These three steps will yield the generalization bound that 
we need, which applies to infinite H. We will focus on binary target functions 
for the purpose of this analysis, so each h € H maps ¥ to {—1, +1}. 

The definition of the growth function is based on the number of different 
hypotheses that H can implement, but only over a finite sample of points 
rather than over the entire input space V. If h € H is applied to a finite sample 
X1, , Xy € X, we get an N-tuple h(x1),--: , (xy) of £1’s. Such an N-tuple 
is called a dichotomy since it splits x1,--- , Xy into two groups: those points for 
which A is —1 and those for which A is +1. Each h € H generates a dichotomy 
on X1,°': ,Xy, but two different h’s may generate the same dichotomy if they 
happen to give the same pattern of +1’s on this particular sample. 


Definition 2.1. Let x1,---,xy E€ Æ. The dichotomies generated by H on 
these points are defined by 


H(x1,°--,xn) = { (k(x) e , A(xn)) | REH}. (2.3) 


One can think of the dichotomies H(x1,--- ,xy) as a set of hypotheses just 
like H is, except that the hypotheses are seen through the eyes of N points 
only. A larger H(x1,-:-: , Xy) means H is more ‘diverse’ — generating more 
dichotomies on x1,::: ,Xy. The growth function is based on the number of 
dichotomies. 


Definition 2.2. The growth function is defined for a hypothesis set H by 


my(N) = Mie Oey »Xn)|; 


X1, 
where |: | denotes the cardinality (number of elements) of a set. 


In words, my(N) is the maximum number of dichotomies that can be gen- 
erated by H on any N points. To compute mz(N), we consider all possible 
choices of N points x,,-:: ,xy from ¥ and pick the one that gives us the 
most dichotomies. Like M, mz,(V) is a measure of the number of hypotheses 
in H, except that a hypothesis is now considered on N points instead of the 
entire X. For any H, since H(x1,-+- ,xy) C {—-1, +1} (the set of all possible 
dichotomies on any N points), the value of mz,(JV) is at most |{-1, +1}4 p 
hence 
my(N) < IN, 


If H is capable of generating all possible dichotomies on x1,- , xy, then 
H(xi, e ,xn) = {—1,+1}" and we say that H can shatter x1,--- , xy. This 
signifies that H is as diverse as can be on this particular sample. 
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x e o 
o © x 
x (J x re) 
(a) (b) (c) 


Figure 2.1: Illustration of the growth function for a two-dimensional per- 
ceptron. The dichotomy of red versus blue on the 3 colinear points in part 
(a) cannot be generated by a perceptron, but all 8 dichotomies on the 3 
points in part (b) can. By contrast, the dichotomy of red versus blue on 
the 4 points in part (c) cannot be generated by a perceptron. At most 14 
out of the possible 16 dichotomies on any 4 points can be generated. 


Example 2.1. If X is a Euclidean plane and H is a two-dimensional percep- 
tron, what are m7,(3) and m,(4)? Figure 2.1(a) shows a dichotomy on 3 points 
that the perceptron cannot generate, while Figure 2.1(b) shows another 3 
points that the perceptron can shatter, generating all 23 = 8 dichotomies. 
Because the definition of my(N) is based on the maximum number of di- 
chotomies, m7/(3) = 8 in spite of the case in Figure 2.1(a). 

In the case of 4 points, Figure 2.1(c) shows a dichotomy that the perceptron 
cannot generate. One can verify that there are no 4 points that the perceptron 
can shatter. The most a perceptron can do on any 4 points is 14 dichotomies 
out of the possible 16, where the 2 missing dichotomies are as depicted in 
Figure 2.1(c) with blue and red corresponding to —1, +1 or to +1, —1. Hence, 
my (4) = 14. O 


Let us now illustrate how to compute mz(N) for some simple hypothesis 
sets. These examples will confirm the intuition that m,(V) grows faster 
when the hypothesis set H becomes more complex. This is what we expect of 
a quantity that is meant to replace M in the generalization bound (2.1). 


Example 2.2. Let us find a formula for m,(V) in each of the following cases. 


1. Positive rays: H consists of all hypotheses h: R > {—1, +1} of the form 
h(x) = sign(x — a), i.e., the hypotheses are defined in a one-dimensional 
input space, and they return —1 to the left of some value a and +1 to 
the right of a. 





a 
3 see N 


Tı T2 
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To compute my(N), we notice that given N points, the line is split by 
the points into N + 1 regions. The dichotomy we get on the N points 
is decided by which region contains the value a. As we vary a, we will 
get N +1 different dichotomies. Since this is the most we can get for 
any N points, the growth function is 


my(N) =N-+1. 


Notice that if we picked N points where some of the points coincided 
(which is allowed), we will get less than N + 1 dichotomies. This does 
not affect the value of my (N) since it is defined based on the maximum 
number of dichotomies. 


2. Positive intervals: H consists of all hypotheses in one dimension that 
return +1 within some interval and —1 otherwise. Each hypothesis is 
specified by the two end values of that interval. 





To compute my(N), we notice that given N points, the line is again 
split by the points into N + 1 regions. The dichotomy we get is decided 
by which two regions contain the end values of the interval, resulting 
in & 7 1) different dichotomies. If both end values fall in the same 


region, the resulting hypothesis is the constant —1 regardless of which 
region it is. Adding up these possibilities, we get 


N+1 1 1 
mal) = ( a )+1=3N + pyar 


Notice that my(N) grows as the square of N, faster than the lin- 
ear m,(V) of the ‘simpler’ positive ray case. 


3. Convex sets: H consists of all hypotheses in two dimensions h: R? — 
{—1,+1} that are positive inside some convex set and negative elsewhere 
(a set is convex if the line segment connecting any two points in the set 
lies entirely within the set). . 


To compute my(N) in this case, we need to choose the N points care- 
fully. Per the next figure, choose N points on the perimeter of a circle. 
Now consider any dichotomy on these points, assigning an arbitrary pat- 
tern of +1’s to the N points. If you connect the +1 points with a polygon, 
the hypothesis made up of the closed interior of the polygon (which has 
to be convex since its vertices are on the perimeter of a circle) agrees 
with the dichotomy on all N points. For the dichotomies that have less 
than three +1 points, the convex set will be a line segment, a point, or 
an empty set. 
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This means that any dichotomy on these N points can be realized using a 
convex hypothesis, so H manages to shatter these points and the growth 
function has the maximum possible value 


Notice that if the N points were chosen at random in the plane rather 
than on the perimeter of a circle, many of the points would be ‘internal’ 
and we wouldn’t be able to shatter all the points with convex hypotheses 
as we did for the perimeter points. However, this doesn’t matter as far 
as my(N) is concerned, since it is defined based on the maximum (2% 
in this case). oO 


It is not practical to try to compute my(N) for every hypothesis set we use. 
Fortunately, we don’t have to. Since my(N) is meant to replace M in (2.1), 
we can use an upper bound on m7(JV) instead of the exact value, and the 
inequality in (2.1) will still hold. Getting a good bound on my(N) will prove 
much easier than computing my(N) itself, thanks to the notion of a break 
point. 


Definition 2.3. If no data set of size k can be shattered by H, then k is said 
to be a break point for H. 


If k is a break point, then my(k) < 2*. Example 2.1 shows that k = 4 is a 
break point for two-dimensional perceptrons. In general, it is easier to find a 
break point for H than to compute the full growth function for that H. 


Exercise 2.1 


By inspection, find a break point k for each hypothesis set in Example 2.2 
(if there is one). Verify that m(k) < 2* using the formulas derived in 
that Example. 


We now use the break point k to derive a bound on the growth function my (N) 
for all values of N. For example, the fact that no 4 points can be shattered by 
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the two-dimensional perceptron puts a significant constraint on the number of 
dichotomies that can be realized by the perceptron on 5 or more points. We 
will exploit this idea to get a significant bound on my(N) in general. 


2.1.2 Bounding the Growth Function 


The most important fact about growth functions is that if the condition 
my(N) = 2% breaks at any point, we can bound my(N) for all values of N 
by a simple polynomial based on this break point. The fact that the bound 
is polynomial is crucial. Absent a break point (as is the case in the convex 
hypothesis example), m(N) = 2% for all N. If my(N) replaced M in Equa- 


tion (2.1), the bound \/ oh In 2M on the generalization error would not go to 


zero regardless of how many training examples N we have. However, if m,(V) 
can be bounded by a polynomial — any polynomial —, the generalization error 
will go to zero as N — oo. This means that we will generalize well given a 
sufficient number of examples. 





the following part without compromising the logical 
sequence. A similar green box will tell you when to 





To prove the polynomial bound, we will introduce a combinatorial quantity 
that counts the maximum number of dichotomies given that there is a break 
point, without having to assume any particular form of H. This bound will 
therefore apply to any H. 


Definition 2.4. B(N,k) is the maximum number of dichotomies on N points 
such that no subset of size k of the N points can be shattered by these di- 
chotomies. 


The definition of B(N,k) assumes a break point k, then tries to find the 
most dichotomies on N points without imposing any further restrictions. 
Since B(N, k) is defined as a maximum, it will serve as an upper bound for 
any mz(V) that has a break point k; 


my(N) < B(N,k) if kis a break point for H. 


The notation B comes from ‘Binomial’ and the reason will become clear 
shortly. To evaluate B(N, k), we start with the two boundary conditions 
k= 1 and N = 1: 


B(N,1) = 1 
2 for k>l. 


w 
~ 
ba 

GO 

| 
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B(N,1) = 1 for all N since if no subset of size 1 can be shattered, then only 
one dichotomy can be allowed. A second different dichotomy must differ on at 
least one point and then that subset of size 1 would be shattered. B(1,k) = 2 
for k > 1 since in this case there do not even exist subsets of size k; the 
constraint is vacuously true and we have 2 possible dichotomies (+1 and —1) 
on the one point. 

We now assume N > 2 and k > 2 and try to develop a recursion. Consider 
the B(N, k) dichotomies in definition 2.4, where no k points can be shattered. 
We list these dichotomies in the following table, 








where x1,::: ,Xy in the table are labels for the N points of the dichotomy. 
We have chosen a convenient order in which to list the dichotomies, as follows. 
Consider the dichotomies on x1,:-- ,X~—1. Some dichotomies on these N — 1 
points appear only once (with either +1 or —1 in the xy column, but not 
both). We collect these dichotomies in the set S1. The remaining dichotomies 
on the first N — 1 points appear twice, once with +1 and once with —1 in 
the xy column. We collect these dichotomies in the set Sa which can be 
divided into two equal parts, Sf and Sy; (with +1 and —1 in the xy column, 
respectively). Let Sı have a rows, and let sy and S; have 8 rows each. Since 
the total number of rows in the table is B(N, k) by construction, we have 


B(N,k) = a + 22. (2.4) 


The total number of different dichotomies on the first N — 1 points is given 
by a+; since Sf and Sz are identical on these N — 1 points, their di- 
chotomies are redundant. Since no subset of k of these first N — 1 points can 
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be shattered (since no k-subset of all N points can be shattered), we deduce 
that 
a+B< BIN —-1,k) (2.5) 


by definition of B. Further, no subset of size k —1 of the first N — 1 points can 
be shattered by the dichotomies in S$. If there existed such a subset, then 
taking the corresponding set of dichotomies in S, and adding xy to the data 
points yields a subset of size k that is shattered, which we know cannot exist 
in this table by definition of B(N,k). Therefore, 


B< BIN -1,k-1). (2.6) 
Substituting the two Inequalities (2.5) and (2.6) into (2.4), we get 
B(N,k) < B(N —1,k)+ B(N —1,k -1). (2.7) 


We can use (2.7) to recursively compute a bound on B(N, k), as shown in the 
following table. 





k 
1 2 3 4 5 6 
111 2 2. 27 2—72 
2/1 3 4 4 4 4 
on. 4 T 8 8 8 
N ga 
4/1 5 11 
5} 1 
6,1. 7 


where the first row (N = 1) and the first column (k = 1) are the bound- 
ary conditions that we already calculated. We can also use the recursion to 
bound B(N, k) analytically. 


Lemma 2.3 (Sauer’s Lemma). 


Proof. The statement is true whenever k = 1 or N = 1, by inspection. The 
proof is by induction on N. Assume the statement is true for all N < N, 
and all k. We need to prove the statement for N = N,+1 and all k. Since 
the statement is already true when k = 1 (for all values of N) by the initial 
condition, we only need to worry about k > 2. By (2.7), 


B(N, +1,k) < B(No,k) + B(No, k — 1). 
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Applying the induction hypothesis to each term on the RHS, we get 
B(No+1,k) < y- e os eo 
EOE 
= F(T) + (4) 


k—1 k—1 
_ H a ‘) = y a ') | 
= 


A 


i=0 


where the combinatorial identity ( Da = (7) -+ e a has been used. 


This identity can be proved by noticing that to calculate the number of ways 
to pick i objects from N,+1 distinct objects, either the first object is included, 


in pa ° ) ways, or the first object is not included, in Ga ways. We have 


thus proved the induction step, so the statement is true for all N and k. E 


i 
we only need the inequality of Lemma 2.3 to bound the growth function. For 
N 
i 
in the sum is polynomial (of degree i < k — 1). Since B(N,k) is an upper 
bound on any my(N) that has a break point k, we have proved 


It turns out that B(N, k) in fact equals Spa p ) (see Problem 2.4), but 


a given break point k, the bound o ( is polynomial in N, as each term 





End safe skip: Those who skipped are now rejoining 
us. The next theorem states that any growth function 
my(N) with a break point is bounded by a polyno- 
mial. 





Theorem 2.4. If mz(k) < 2" for some value k, then 


for all N. The RHS is polynomial in N of degree k — 1. 


The implication of Theorem 2.4 is that if H has a break point, we have 
what we want to ensure good generalization; a polynomial bound on my(N). 
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Exercise 2.2 
(a) Verify the bound of Theorem 2.4 in the three cases of Example 2.2: 

(i) Positive rays: H. consists of all hypotheses in one dimension of 
the form h(x) = sign(z — a). 

(ii) Positive intervals: H consists of all hypotheses in one dimension 
that are positive within some interval and negative elsewhere. 

(iii) Convex sets: H consists of all hypotheses in two dimensions that 
are positive inside some convex set and negative elsewhere. 

(Note: you can use the break points you found in Exercise 2.1.) 


(b) Does there exist a hypothesis set for which mu (N) = N + 210/2] 
(where |N/2] is the largest integer < N/2)? 


2.1.3 The VC Dimension 


Theorem 2.4 bounds the entire growth function in terms of any break point. 
The smaller the break point, the better the bound. This leads us to the fol- 
lowing definition of a single parameter that characterizes the growth function. 


Definition 2.5. The Vapnik-Chervonenkis dimension of a hypothesis set H, 
denoted by dy.(H) or simply dvo, is the largest value of N for which my(N) = 
2N, If my(N) = 2% for all N, then dyo(H) = œ. 

If dvo is the VC dimension of H, then k = dvo + 1 is a break point for my 
since mz(N) cannot equal 2% for any N > dyo by definition. It is easy to see 
that no smaller break point exists since H can shatter dy, points, hence it can 
also shatter any subset of these points. 


Exercise 2.3 


Compute the VC dimension of H for the hypothesis sets in parts (i), (ii), 
(iii) of Exercise 2.2(a). 


Since k = dy, + 1 is a break point for my, Theorem 2.4 can be rewritten in 
terms of the VC dimension: 


ies o | (2.9) 


Therefore, the VC dimension is the order of the polynomial bound on my(N). 
It is also the best we can do using this line of reasoning, because no smaller 
break point than k = dy, + 1 exists. The form of the polynomial bound can 
be further simplified to make the dependency on dyo more salient. We state a 
useful form here, which can be proved by induction (Problem 2.5). 


myl N) < N% +1. (2.10) 
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Now that the growth function has been bounded in terms of the VC dimen- 
sion, we have only one more step left in our analysis, which is to replace the 
number of hypotheses M in the generalization bound (2.1) with the growth 
function mz(N). If we manage to do that, the VC dimension will play a 
pivotal role in the generalization question. If we were to directly replace M 
by mz(N) in (2.1), we would get a bound of the form 





1 2m (N) 


? 
< F 
Eout < Ein F = ln 5 





Unless dvo(H) = 00, we know that m7(V) is bounded by a polynomial in N; 
thus, Inm(N) grows logarithmically in N regardless of the order of the poly- 
nomial, and so it will be crushed by the 4. factor. Therefore, for any fixed 
tolerance 6, the bound on Eout will be arbitrarily close to Ein for sufficiently 
large N. 

Only if dyo(H) = œœ will this argument fail, as the growth function in this 
case is exponential in N. For any finite value of dvc, the error bar will converge 
to zero at a speed determined by dvo, since dvo is the order of the polynomial. 
The smaller dy, is, the faster the convergence to zero. 

It turns out that we cannot just replace M with m,(V) in the generaliza- 
tion bound (2.1), but rather we need to make other adjustments as we will see 
shortly. However, the general idea above is correct, and dvc will still play the 
role that we discussed here. One implication of this discussion is that there 
is a division of models into two classes. The ‘good models’ have finite dyc, 
and for sufficiently large N, Ein will be close to Eout; for good models, the 
in-sample performance generalizes to out of sample. The ‘bad models’ have 
infinite dye. With a bad model, no matter how large the data set is, we cannot 
make generalization conclusions from Ein to Fout based on the VC analysis.” 

Because of its significant role, it is worthwhile to try to gain some insight 
about the VC dimension before we proceed to the formalities of deriving the 
new generalization bound. One way to gain insight about dvo is to try to 
compute it for learning models that we are familiar with. Perceptrons are one 
case where we can compute dy. exactly. This is done in two steps. First, 
we show that dvo is at least a certain value, then we show that it is at most 
the same value. There is a logical difference in arguing that dvo is at least a 
certain value, as opposed to at most a certain value. This is because 


dvo > N <=> there exists D of size N such that H shatters D, 
hence we have different conclusions in the following cases. 


1. There is a set of N points that can be shattered by H. In this case, we 
can conclude that dvo > N. 





2In some cases with infinite dvc, such as the convex sets that we discussed, alternative 
analysis based on an ‘average’ growth function can establish good generalization behavior. 
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2. Any set of N points can be shattered by H. In this case, we have more 
than enough information to conclude that dy, > N. 


3. There is a set of N points that cannot be shattered by H. Based only on 
this information, we cannot conclude anything about the value of dyo. 


4. No set of N points can be shattered by H. In this case, we can conclude 
that dvo < N. 


Exercise 2.4 


Consider the input space VY = {1} x R? (including the constant coordinate 
zo = 1). Show that the VC dimension of the perceptron (with d+ 1 
parameters, counting wo) is exactly d+1 by showing that it is at least d+ 1 
and at most d+ 1, as follows. 


(a) To show that dvo > d+1, find d+1 points in Æ that the perceptron 
can shatter. [Hint: Construct a nonsingular (d+ 1) x (d+ 1) matrix 
whose rows represent the d+ 1 points, then use the nonsingularity to 
argue that the perceptron can shatter these points.] 


(b) To show that dvo < d+ 1, show that no set of d+ 2 points in ¥ 
can be shattered by the perceptron. [Hint: Represent each point 
in X as a vector of length d+ 1, then use the fact that any d +2 
vectors of length d+ 1 have to be linearly dependent. This means 
that some vector is a linear combination of all the other vectors. 
Now, if you choose the class of these other vectors carefully, then the 
classification of the dependent vector will be dictated. Conclude that 
there is some dichotomy that cannot be implemented, and therefore 
that for N > d+2, my(N) < 2%] 


The VC dimension of a d-dimensional perceptron? is indeed d+ 1. This is 
consistent with Figure 2.1 for the case d = 2, which shows a VC dimension 
of 3. The perceptron case provides a nice intuition about the VC dimension, 
since d+ 1 is also the number of parameters in this model. One can view 
the VC dimension as measuring the ‘effective’ number of parameters. The 
more parameters a model has, the more diverse its hypothesis set is, which 
is reflected in a larger value of the growth function mz,(N). In the case 
of perceptrons, the effective parameters correspond to explicit parameters in 
the model, namely wo, wi,:+- , wa. In other models, the effective parameters 
may be less obvious or implicit. The VC dimension measures these effective 
parameters or ‘degrees of freedom’ that enable the model to express a diverse 
set of hypotheses. 

Diversity is not necessarily a good thing in the context of generalization. 
For example, the set of all possible hypotheses is as diverse as can be, so 
my(N) = 2% for all N and dy.(H) = oo. In this case, no generalization at all 
is to be expected, as the final version of the generalization bound will show. 


3X = {1} x Rf is considered d-dimensional since the first coordinate zo = 1 is fixed. 
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2.1.4 The VC Generalization Bound 


If we treated the growth function as an effective number of hypotheses, and 
replaced M in the generalization bound (2.1) with my (N), the resulting bound 
would be 
? 1 2m (N 

Fout(g) < Fin(g) + 3N In amu) ) : 
It turns out that this is not exactly the form that will hold. The quantities in 
red need to be technically modified to make (2.11) true. The correct bound, 
which is called the VC generalization bound, is given in the following theorem; 
it holds for any binary target function f, any hypothesis set H, any learning 
algorithm .A, and any input probability distribution P. 


(2.11) 


Theorem 2.5 (VC generalization bound). For any tolerance 6 > 0, 


8. 4mx(2N) 


< E; =e . 
Eout(9) < Ein(g) +f In E (2.12) 


with probability > 1 — ô. 


If you compare the blue items in (2.12) to their red counterparts in (2.11), you 
notice that all the blue items move the bound in the weaker direction. How- 
ever, as long as the VC dimension is finite, the error bar still converges to zero 
(albeit at a slower rate), since my (2N) is also polynomial of order dyo in N, 
just like my (N). This means that, with enough data, each and every hypoth- 
esis in an infinite H with a finite VC dimension will generalize well from Fin 
to Eou. The key is that the effective number of hypotheses, represented by 
the finite growth function, has replaced the actual number of hypotheses in 
the bound. 

The VC generalization bound is the most important mathematical result 
in the theory of learning. It establishes the feasibility of learning with infinite 
hypothesis sets. Since the formal proof is somewhat lengthy and technical, we 
illustrate the main ideas in a sketch of the proof, and include the formal proof 
as an appendix. There are two parts to the proof; the justification that the 
growth function can replace the number of hypotheses in the first place, and 
the reason why we had to change the red items in (2.11) into the blue items 
in (2.12). 


Sketch of the proof. The data set D is the source of randomization in the 
original Hoeffding Inequality. Consider the space of all possible data sets. Let 
us think of this space as a ‘canvas’ (Figure 2.2(a)). Each D is a point on that 
canvas. The probability of a point is determined by which x,,’s in ¥ happen to 
be in that particular D, and is calculated based on the distribution P over æ. 
Let’s think of probabilities of different events as areas on that canvas, so the 
total area of the canvas is 1. 
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space of 


data sets 





(a) Hoeffding Inequality (b) Union Bound (c) VC Bound 


Figure 2.2: Illustration of the proof of the VC bound, where the ‘canvas’ 
represents the space of all data sets, with areas corresponding to probabili- 
ties. (a) For a given hypothesis, the colored points correspond to data sets 
where Ein does not generalize well to Four. The Hoeffding Inequality guar- 
antees a small colored area. (b) For several hypotheses, the union bound 
assumes no overlaps, so the total colored area is large. (c) The VC bound 
keeps track of overlaps, so it estimates the total area of bad generalization 
to be relatively small. 


For a given hypothesis h € H, the event “|Fin(h) — Eout(h)| > & consists 
of all points D for which the statement is true. For a particular h, let us paint 
all these ‘bad’ points using one color. What the basic Hoeffding Inequality 
tells us is that the colored area on the canvas will be small (Figure 2.2(a)). 

Now, if we take another h € H, the event “|Fin(h) — Eout(h)| > © may 
contain different points, since the event depends on h. Let us paint these points 
with a different color. The area covered by all the points we colored will be 
at most the sum of the two individual areas, which is the case only if the two 
areas have no points in common. This is the worst case that the union bound 
considers. If we keep throwing in a new colored area for each h € H, and never 
overlap with previous colors, the canvas will soon be mostly covered in color 
(Figure 2.2(b)). Even if each h contributed very little, the sheer number of 
hypotheses will eventually make the colored area cover the whole canvas. This 
was the problem with using the union bound in the Hoeffding Inequality (1.6), 
and not taking the overlaps of the colored areas into consideration. 

The bulk of the VC proof deals with how to account for the overlaps. Here 
is the idea. If you were told that the hypotheses in H are such that each 
point on the canvas that is colored will be colored 100 times (because of 100 
different h’s), then the total colored area is now 1/100 of what it would have 
been if the colored points had not overlapped at all. This is the essence of 
the VC bound as illustrated in (Figure 2.2(c)). The argument goes as follows. 
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Many hypotheses share the same dichotomy on a given D, since there are 
finitely many dichotomies even with an infinite number of hypotheses. Any 
statement based on D alone will be simultaneously true or simultaneously 
false for all the hypotheses that look the same on that particular D. What 
the growth function enables us to do is to account for this kind of hypothesis 
redundancy in a precise way, so we can get a factor similar to the ‘100’ in the 
above example. 


When H is infinite, the redundancy factor will also be infinite since the 
hypotheses will be divided among a finite number of dichotomies. Therefore, 
the reduction in the total colored area when we take the redundancy into 
consideration will be dramatic. If it happens that the number of dichotomies 
is only a polynomial, the reduction will be so dramatic as to bring the total 
probability down to a very small value. This is the essence of the proof of 
Theorem 2.5. 


The reason m,(2N) appears in the VC bound instead of m7,(N) is that 
the proof uses a sample of 2N points instead of N points. Why do we need 2N 
points? The event “|Fi,(h) — Fout(h)| > e depends not only on D, but also on 
the entire X because Eout(A) is based on ¥. This breaks the main premise of 
grouping h’s based on their behavior on D, since aspects of each h outside of D 
affect the truth of “|Ein(h) — Eout(h)| > €.” To remedy that, we consider the 
artificial event “|Ein(h) — Ej,(h)| > e instead, where Ein and E}, are based 
on two samples D and D’ each of size N. This is where the 2N comes from. 
It accounts for the total size of the two samples D and D’. Now, the truth of 
the statement “|Fin(h) — Ef,(h)| > & depends exclusively on the total sample 
of size 2N, and the above redundancy argument will hold. 


Of course we have to justify why the two-sample condition “|Fi,(h) — 
Ei, (h)| > € can replace the original condition “|Ein(h) — Eous(h)| > €” In 
doing so, we end up having to shrink the e’s by a factor of 4, and also end up 
with a factor of 2 in the estimate of the overall probability. This accounts for 
the È instead of zW in the VC bound and for having 4 instead of 2 as the 
multiplicative factor of the growth function. When you put all this together, 


you get the formula in (2.12). O 


2.2 Interpreting the Generalization Bound 


The VC generalization bound (2.12) is a universal result in the sense that 
it applies to all hypothesis sets, learning algorithms, input spaces, probability 
distributions, and binary target functions. It can be extended to other types of 
target functions as well. Given the generality of the result, one would suspect 
that the bound it provides may not be particularly tight in any given case, 
since the same bound has to cover a lot of different cases. Indeed, the bound 
is quite loose. 
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Exercise 2.5 

Suppose we have a simple learning model whose growth function is 
mu(N) = N +1, hence dvo = 1. Use the VC bound (2.12) to esti- 
mate the probability that Fout will be within 0.1 of Ein given 100 training 
examples. [Hint: The estimate will be ridiculous.] 


Why is the VC bound so loose? The slack in the bound can be attributed to 
a number of technical factors. Among them, 


1. The basic Hoeffding Inequality used in the proof already has a slack. 
The inequality gives the same bound whether Hout is close to 0.5 or 
close to zero. However, the variance of Ein is quite different in these 
two cases. Therefore, having one bound capture both cases will result 
in some slack. 


2. Using mz(N) to quantify the number of dichotomies on N points, re- 
gardless of which N points are in the data set, gives us a worst-case 
estimate. This does allow the bound to be independent of the prob- 
ability distribution P over Æ. However, we would get a more tuned 
bound if we considered specific x1, ,xy and used |H(x1,--: ,xw)| or 
its expected value instead of the upper bound my(N). For instance, in 
the case of convex sets in two dimensions, which we examined in Exam- 
ple 2.2, if you pick N points at random in the plane, they will likely have 
far fewer dichotomies than 2", while my(N) = 2%. 


3. Bounding my(N) by a simple polynomial of order dvo, as given in (2.10), 
will contribute further slack to the VC bound. 


Some effort could be put into tightening the VC bound, but many highly 
technical attempts in the literature have resulted in only diminishing returns. 
The reality is that the VC line of analysis leads to a very loose bound. Why 
did we bother to go through the analysis then? Two reasons. First, the VC 
analysis is what establishes the feasibility of learning for infinite hypothesis 
sets, the only kind we use in practice. Second, although the bound is loose, 
it tends to be equally loose for different learning models, and hence is useful 
for comparing the generalization performance of these models. This is an 
observation from practical experience, not a mathematical statement. In real 
applications, learning models with lower dy, tend to generalize better than 
those with higher dvo. Because of this observation, the VC analysis proves 
useful in practice, and some rules of thumb have emerged in terms of the VC 
dimension. For instance, requiring that N be at least 10 x dvo to get decent 
generalization is a popular rule of thumb. 

Thus, the VC bound can be used as a guideline for generalization, relatively 
if not absolutely. With this understanding, let us look at the different ways 
the bound is used in practice. 
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2.2.1 Sample Complexity 


The sample complexity denotes how many training examples N are needed 
to achieve a certain generalization performance. The performance is specified 
by two parameters, € and 6. The error tolerance e determines the allowed 
generalization error, and the confidence parameter 6 determines how often the 
error tolerance € is violated. How fast N grows as e and 6 become smaller 
indicates how much data is needed to get good generalization. 

We can use the VC bound to estimate the sample complexity for a given 
learning model. Fix 6 > 0, and suppose we want the generalization error to 
be at most e. From Equation (2.12), the generalization error is bounded by 


8 jp 4mu2N) and so it suffices to make 4/ $ In *@#2%) < e, It follows that 
N 5 N 5 


8 í Ama em) 


N > zm 


suffices to obtain generalization error at most € (with probability at least 1— ô). 
This gives an implicit bound for the sample complexity N, since N appears on 
both sides of the inequality. If we replace m,(2N) in (2.12) by its polynomial 
upper bound in (2.10) which is based on the the VC dimension, we get a 


similar bound 3 
A((2N)*%e +1 
n> Bn (MOMMY) he 
€ 





ô 


which is again implicit in N. We can obtain a numerical value for N using 
simple iterative methods. 


Example 2.6. Suppose that we have a learning model with dvo = 3 and 
would like the generalization error to be at most 0.1 with confidence 90% (so 
€= 0.1 and 6 = 0.1). How big a data set do we need? Using (2.13), we need 


8 4(2N)3 +4 
>= ne ae 
N20R in ( 0.1 ) 


Trying an initial guess of N = 1,000 in the RHS, we get 


8 4(2 x 1000)? + 4 
N2 gr m ( 0.1 

We then try the new value N = 21,193 in the RHS and continue this iterative 
process, rapidly converging to an estimate of N œ~ 30,000. If dvo were 4, a 
similar calculation will find that N = 40,000. For dve = 5, we get N =~ 50, 000. 
You can see that the inequality suggests that the number of examples needed 
is approximately proportional to the VC dimension, as has been observed in 
practice. The constant of proportionality it suggests is 10,000, which is a gross 
overestimate; a more practical constant of proportionality is closer to 10. O 


) ~ 21, 193. 


4The term ‘complexity’ comes from a similar metaphor in computational complexity. 
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2.2.2 Penalty for Model Complexity 


Sample complexity fixes the performance parameters € (generalization error) 
and 6 (confidence parameter) and estimates how many examples N are needed. 
In most practical situations, however, we are given a fixed data set D, so N 
is also fixed. In this case, the relevant question is what performance can we 
expect given this particular N. The bound in (2.12) answers this question: 
with probability at least 1 — 6, 


Fout(g) < Fin(g) + —In 5 


x (Maen) . 


If we use the polynomial bound based on dvo instead of mz(2N), we get 
another valid bound on the out-of-sample error, 


Eout(g) < Ein(g) + In (ote l (2.14) 


Example 2.7. Suppose that N = 100 and we have a 90% confidence require- 
ment (ô = 0.1). We could ask what error bar can we offer with this confidence, 
if H has dy, = 1. Using (2.14), we have 


8 4(201 
Eoue(9) < Ein(9) + T (2) ~ Bin(g) +0848 (215) 


with confidence > 90%. This is a pretty poor bound on Eout. Even if Ein = 0, 
Eout may still be close to 1. If N = 1,000, then we get Eout(g) < Fin(g)+0.301, 
a somewhat more respectable bound. E 


Let us look more closely at the two parts that make up the bound on Eout 
in (2.12). The first part is Ein, and the second part is a term that increases 
as the VC dimension of H increases. 


Fout(g) < Fin(g) + Q(N,H, ô), (2.16) 


QN, H, ô) = Jim (ae) 


D (Hens $ 2 


where 








IA 





One way to think of Q(N,H, 6) is that it is a penalty for model complexity. It 
penalizes us by worsening the bound on Eout when we use a more complex H 
(larger dvc). If someone manages to fit a simpler model with the same training 
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out-of-sample error 






“model complexity 
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Figure 2.3: When we use a more complex learning model, one that has 
higher VC dimension dyc, we are likely to fit the training data better re- 
sulting in a lower in-sample error, but we pay a higher penalty for model 
complexity. A combination of the two, which estimates the out-of-sample 
error, thus attains a minimum at some intermediate dže. 


error, they will get a more favorable estimate for Eout. The penalty Q(N, H, ô) 
gets worse if we insist on higher confidence (lower 6), and it gets better when 
we have more training examples, as we would expect. 

Although Q(N,H,6) goes up when H has a higher VC dimension, Ein is 
likely to go down with a higher VC dimension as we have more choices within H 
to fit the data. Therefore, we have a tradeoff: more complex models help Fin 
and hurt Q(NV,H,6). The optimal model is a compromise that minimizes a 
combination of the two terms, as illustrated informally in Figure 2.3. 


2.2.3 The Test Set 


As we have seen, the generalization bound gives us a loose estimate of the 
out-of-sample error Egy, based on Ein. While the estimate can be useful as 
a guideline for the training process, it is next to useless if the goal is to get 
an accurate forecast of Eout. If you are developing a system for a customer, 
you need a more accurate estimate so that your customer knows how well the 
system is expected to perform. 

An alternative approach that we alluded to in the beginning of this chapter 
is to estimate Fou, by using a test set, a data set that was not involved in the 
training process. The final hypothesis g is evaluated on the test set, and the 
result is taken as an estimate of Eout. We would like to now take a closer look 
at this approach. 

Let us call the error we get on the test set Press. When we report Piest as 
our estimate of Eout, we are in fact asserting that Fest generalizes very well 
to Four. After all, Eres_ is just a sample estimate like Fin. How do we know 
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that Frest generalizes well? We can answer this question with authority now 
that we have developed the theory of generalization in concrete mathematical 
terms. 

The effective number of hypotheses that matters in the generalization be- 
havior of Frest is 1. There is only one hypothesis as far as the test set is 
concerned, and that’s the final hypothesis g that the training phase produced. 
This hypothesis would not change if we used a different test set as it would if 
we used a different training set. Therefore, the simple Hoeffding Inequality is 
valid in the case of a test set. Had the choice of g been affected by the test 
set in any shape or form, it wouldn’t be considered a test set any more and 
the simple Hoeffding Inequality would not apply. 

Therefore, the generalization bound that applies to Frest is the simple 
Hoeffding Inequality with one hypothesis. This is a much tighter bound than 
the VC bound. For example, if you have 1,000 data points in the test set, Frest 
will be within +5% of Eout with probability > 98%. The bigger the test set 
you use, the more accurate Frest will be as an estimate of Eout. 


Exercise 2.6 

A data set has 600 examples. To properly test the performance of the 
final hypothesis, you set aside a randomly selected subset of 200 examples 
which are never used in the training phase; these form a test set. You use 
a learning model with 1,000 hypotheses and select the final hypothesis g 
based on the 400 training examples. We wish to estimate Hout(g). We have 
access to two estimates: Ejn(g), the in-sample error on the 400 training 
examples; and, Etest(g), the test error on the 200 test examples that were 
set aside. 


(a) Using a 5% error tolerance (6 = 0.05), which estimate has the higher 
‘error bar’? 


(b) Is there any reason why you shouldn't reserve even more examples for 
testing? 


Another aspect that distinguishes the test set from the training set is that the 
test set is not biased. Both sets are finite samples that are bound to have 
some variance due to sample size, but the test set doesn’t have an optimistic 
or pessimistic bias in its estimate of Eout. The training set has an optimistic 
bias, since it was used to choose a hypothesis that looked good on it. The VC 
generalization bound implicitly takes that bias into consideration, and that’s 
why it gives a huge error bar. The test set just has straight finite-sample 
variance, but no bias. When you report the value of Eyes, to your customer 
and they try your system on new data, they are as likely to be pleasantly 
surprised as unpleasantly surprised, though quite likely not to be surprised at 
all. 

There is a price to be paid for having a test set. The test set does not 
affect the outcome of our learning process, which only uses the training set. 
The test set just tells us how well we did. Therefore, if we set aside some 
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of the data points provided by the customer as a test set, we end up using 
fewer examples for training. Since the training set is used to select one of the 
hypotheses in H, training examples are essential to finding a good hypothesis. 
If we take a big chunk of the data for testing and end up with too few examples 
for training, we may not get a good hypothesis from the training part even if 
we can reliably evaluate it in the testing part. We may end up reporting to 
the customer, with high confidence mind you, that the g we are delivering is 
terrible ©). There is thus a tradeoff to setting aside test examples. We will 
address that tradeoff in more detail and learn some clever tricks to get around 
it in Chapter 4. 

In some of the learning literature, Eyes is used as synonymous with Fout- 
When we report experimental results in this book, we will often treat Frest 
based on a large test set as if it was Eout because of the closeness of the two 
quantities. 


2.2.4 Other Target Types 


Although the VC analysis was based on binary target functions, it can be 
extended to real-valued functions, as well as to other types of functions. The 
proofs in those cases are quite technical, and they do not add to the insight 
that the VC analysis of binary functions provides. Therefore, we will introduce 
an alternative approach that covers real-valued functions and provides new 
insights into generalization. The approach is based on bias-variance analysis, 
and will be discussed in the next section. 

In order to deal with real-valued functions, we need to adapt the definitions 
of Ein and Eout that have so far been based on binary functions. We defined Fin 
and Eout in terms of binary error; either h(x) = f(x) or else h(x) # f(x). If f 
and h are real-valued, a more appropriate error measure would gauge how far 
f(x) and h(x) are from each other, rather than just whether their values are 
exactly the same. 

An error measure that is commonly used in this case is the squared error 
e(h(x), f(x)) = (h(x) — f(x))®. We can define in-sample and out-of-sample 
versions of this error measure. The out-of-sample error is based on the ex- 
pected value of the error measure over the entire input space 7, 


Eout(h) = E [(h(x) — f(x))?] , 


while the in-sample error is based on averaging the error measure over the 
data set, 


En (h) = (h(n) — Fn) 


These definitions make Ein a sample estimate of Four just as it was in the case 
of binary functions. In fact, the error measure used for binary functions can 
also be expressed as a squared error. 
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Exercise 2.7 


For binary target functions, show that P[h(x) Æ f(x)] can be written as an 
expected value of a mean-squared error measure in the following cases. 


(a) The convention used for the binary function is 0 or 1. 


(b) The convention used for the binary function is +1. 


[Hint: The difference between (a) and (b) is just a scale.] 


Just as the sample frequency of error converges to the overall probability of 
error per Hoeffding’s Inequality, the sample average of squared error converges 
to the expected value of that error (assuming finite variance). This is a man- 
ifestation of what is referred to as the ‘law of large numbers’ and Hoeffding’s 
Inequality is just one form of that law. The same issues of the data set size 
and the hypothesis set complexity come into play just as they did in the VC 
analysis. 


2.3 Approximation-Generalization Tradeoff 


The VC analysis showed us that the choice of H needs to strike a balance 
between approximating f on the training data and generalizing on new data. 
The ideal H is a singleton hypothesis set containing only the target function. 
Unfortunately, we are better off buying a lottery ticket than hoping to have 
this H. Since we do not know the target function, we resort to a larger model 
hoping that it will contain a good hypothesis, and hoping that the data will 
pin down that hypothesis. When you select your hypothesis set, you should 
balance these two conflicting goals; to have some hypothesis in H that can 
approximate f, and to enable the data to zoom in on the right hypothesis. 

The VC generalization bound is one way to look at this tradeoff. If H is 
too simple, we may fail to approximate f well and end up with a large in- 
sample error term. If H is too complex, we may fail to generalize well because 
of the large model complexity term. There is another way to look at the 
approximation-generalization tradeoff which we will present in this section. It 
is particularly suited for squared error measures, rather than the binary error 
used in the VC analysis. The new way provides a different angle; instead of 
bounding Eout by Ein plus a penalty term Q, we will decompose Eout into two 
different error terms. 


2.3.1 Bias and Variance 


The bias-variance decomposition of out-of-sample error is based on squared 
error measures. The out-of-sample error is 


Boue(g'”) = Ex [(g (x) — #())?], (2.17) 
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where E, denotes the expected value with respect to x (based on the probabil- 
ity distribution on the input space 41’). We have made explicit the dependence 
of the final hypothesis g on the data D, as this will play a key role in the cur- 
rent analysis. We can rid Equation (2.17) of the dependence on a particular 
data set by taking the expectation with respect to all data sets. We then get 
the expected out-of-sample error for our learning model, independent of any 
particular realization of the data set, 


Ep [Eouw(g)] = Ep [Ex P) - F] 
Ex [Enl P Œ) — f(«))"I| 


Ex [Bolg (x)?] - 2Ep [9 (IF) + 4P]. 


The term Ep[g'?)(x)] gives an ‘average function’, which we denote by g(x). 
One can interpret g(x) in the following operational way. Generate many data 
sets D,,..., Dx and apply the learning algorithm to each data set to produce 
final hypotheses gi,...,gK. We can then estimate the average function for 
any x by g(x) © $ DA gk(x). Essentially, we are viewing g(x) as a random 
variable, with the randomness coming from the randomness in the data set; 
g(x) is the expected value of this random variable (for a particular x), and g 
is a function, the average function, composed of these expected values. The 
function g is a little counterintuitive; for one thing, g need not be in the 
model’s hypothesis set, even though it is the average of functions that are. 


Exercise 2.8 


(a) Show that if H is closed under linear combination (any linear combi- 
nation of hypotheses in H is also a hypothesis in H), then g € H. 


(b) Give a model for which the average function g is not in the model's 
hypothesis set. [Hint: Use a very simple model.] 


(c) For binary classification, do you expect g to be a binary function? 


We can now rewrite the expected out-of-sample error in terms of 9: 
Ep[Bou(g)] 
= Ex [Eplg'?(x)"] - 29(«) + t], 
Ex] Ela (x)? — a(x)? + 9(x)? — 2900) F) + FOO? , 
Ep [(9 (x) — a(x))?] (G(x) = A)’ 


where the last reduction follows since g(x) is constant with respect to D. 
The term (g(x) — f(x))? measures how much the average function that we 
would learn using different data sets D deviates from the target function that 
generated these data sets. This term is appropriately called the bias: 


bias(x) = (g(x) — f(x))?, 
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as it measures how much our learning model is biased away from the target 
function.” This is because g has the benefit of learning from an unlimited 
number of data sets, so it is only limited in its ability to approximate f by 
the limitation in the learning model itself. The term Ep [(g'?) (x) — 9(x))?] 
is the variance of the random variable gP) (x), 


var(x) = Ep[|(g P} (x) — 9(x))7], 


which measures the variation in the final hypothesis, depending on the data 
set. We thus arrive at the bias-variance decomposition of out-of-sample error, 


Ep[Eout(g™)] = Ex[bias(x) + var(x)] 
= bias + var, 


where bias = E,.{bias(x)] and var = E,.[var(x)]. Our derivation assumed that 
the data was noiseless. A similar derivation with noise in the data would lead 
to an additional noise term in the out-of-sample error (Problem 2.22). The 
noise term is unavoidable no matter what we do, so the terms we are interested 
in are really the bias and var. 

The approximation-generalization tradeoff is captured in the bias-variance 
decomposition. To illustrate, let’s consider two extreme cases: a very small 
model (with one hypothesis) and a very large one with all hypotheses. 


o f 


Very small model. Since there is 
only one hypothesis, both the av- 





Very large model. The target 
function is in H. Different data sets 


erage function g and the final hy- 
pothesis gP? will be the same, for 
any data set. Thus, var = 0. The 
bias will depend solely on how well 
this single hypothesis approximates 
the target f, and unless we are ex- 
tremely lucky, we expect a large 
bias. 


will lead to different hypotheses that 
agree with f on the data set, and are 
spread around f in the red region. 
Thus, bias © 0 because ĝ is likely 
to be close to f. The var is large 
(heuristically represented by the size 
of the red region in the figure). 


One can also view the variance as a measure of ‘instability’ in the learning 
model. Instability manifests in wild reactions to small variations or idiosyn- 
crasies in the data, resulting in vastly different hypotheses. 





5What we call bias is sometimes called bias? in the literature. 


64 


2. TRAINING VERSUS TESTING 2.3. APPROXIMATION- GENERALIZATION 


Example 2.8. Consider a target function f(z) = sin(7x) and a data set 
of size N = 2. We sample x uniformly in [—1,1] to generate a data set 
(21,91); (£2, Y2); and fit the data using one of two models: 


Ho: Set of all lines of the form h(x) = b; 
Hı: Set of all lines of the form h(x) = az + b. 


For Ho, we choose the constant hypothesis that best fits the data (the hori- 
zontal line at the midpoint, b = ite) For Hı, we choose the line that passes 
through the two data points (21, y1) and (2, y2). Repeating this process with 
many data sets, we can estimate the bias and the variance. The figures which 
follow show the resulting fits on the same (random) data sets for both models. 











































































































With Hı, the learned hypothesis is wilder and varies extensively depending 
on the data set. The bias-var analysis is summarized in the next figures. 





x 
Ho 

bias = 0.50; bias = 0.21; 

var = 0.25. var = 1.69. 


Average hypothesis g (red) with var(x) indicated by the gray shaded 
region that is g(x) + y var(zx). 


For H1, the average hypothesis g (red line) is a reasonable fit with a fairly 
small bias of 0.21. However, the large variability leads to a high var of 1.69 
resulting in a large expected out-of-sample error of 1.90. With the simpler 
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model Ho, the fits are much less volatile and we have a significantly lower var 
of 0.25, as indicated by the shaded region. However, the average fit is now 
the zero function, resulting in a higher bias of 0.50. The total out-of-sample 
error has a much smaller expected value of 0.75. The simpler model wins by 
significantly decreasing the var at the expense of a smaller increase in bias. 
Notice that we are not comparing how well the red curves (the average hy- 
potheses) fit the sine. These curves are only conceptual, since in real learning 
we do not have access to the multitude of data sets needed to generate them. 
We have one data set, and the simpler model results in a better out-of-sample 
error on average as we fit our model to just this one data. However, the var 
term decreases as N increases, so if we get a bigger and bigger data set, the 
bias term will be the dominant part of Eout, and H will win. 0O 


The learning algorithm plays a role in the bias-variance analysis that it did 
not play in the VC analysis. Two points are worth noting. 


1. By design, the VC analysis is based purely on the hypothesis set H, in- 
dependently of the learning algorithm A. In the bias-variance analysis, 
both H and the algorithm A matter. With the same H, using a differ- 
ent learning algorithm can produce a different g). Since g) is the 
building block of the bias-variance analysis, this may result in different 
bias and var terms. 


2. Although the bias-variance analysis is based on squared-error measure, 
the learning algorithm itself does not have to be based on minimizing 
the squared error. It can use any criterion to produce g™) based on D. 
However, once the algorithm produces g), we measure its bias and 
variance using squared error. 


Unfortunately, the bias and variance cannot be computed in practice, since 
they depend on the target function and the input probability distribution (both 
unknown). Thus, the bias-variance decomposition is a conceptual tool which 
is helpful when it comes to developing a model. There are two typical goals 
when we consider bias and variance. The first is to try to lower the variance 
without significantly increasing the bias, and the second is to lower the bias 
without significantly increasing the variance. These goals are achieved by 
different techniques, some principled and some heuristic. Regularization is 
one of these techniques that we will discuss in Chapter 4. Reducing the bias 
without increasing the variance requires some prior information regarding the 
target function to steer the selection of H in the direction of f, and this task is 
largely application-specific. On the other hand, reducing the variance without 
compromising the bias can be done through general techniques. 


2.3.2 The Learning Curve 


We close this chapter with an important plot that illustrates the tradeoffs 
that we have seen so far. The learning curves summarize the behavior of the 
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in-sample and out-of-sample errors as we vary the size of the training set. 

After learning with a particular data set D of size N, the final hypothe- 
sis gP) has in-sample error Fin(g) and out-of-sample error Eout (g)), both 
of which depend on D. As we saw in the bias-variance analysis, the expectation 
with respect to all data sets of size N gives the expected errors: Ep[Ein(g)| 
and Ep[Eout (gP). These expected errors are functions of N, and are called 
the learning curves of the model. We illustrate the learning curves for a simple 
learning model and a complex one, based on actual experiments. 























tH = 
O © 
B 2 
eal eal 
E E 
6 5 out 
3 3 
P$ 5 —— 
q £a 
Ein 
Number of Data Points, N Number of Data Points, N 
Simple Model Complex Model 


Notice that for the simple model, the learning curves converge more quickly 
but to worse ultimate performance than for the complex model. This behavior 
is typical in practice. For both simple and complex models, the out-of-sample 
learning curve is decreasing in N, while the in-sample learning curve is in- 
creasing in N. Let us take a closer look at these curves and interpret them in 
terms of the different approaches to generalization that we have discussed. 

In the VC analysis, Eout was expressed as the sum of Fin and a generaliza- 
tion error that was bounded by Q, the penalty for model complexity. In the 
bias-variance analysis, Fou; was expressed as the sum of a bias and a variance. 
The following learning curves illustrate these two approaches side by side. 












Eout 


(generalization error variance 






Expected Error 
Expected Error 





in-sample error 
Number of Data Points, N Number of Data Points, N 
VC Analysis Bias-Variance Analysis 
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The VC analysis bounds the generalization error which is illustrated on the 
left. The bias-variance analysis is illustrated on the right. The bias-variance 
illustration is somewhat idealized, since it assumes that, for every N, the aver- 
age learned hypothesis g has the same performance as the best approximation 
to f in the learning model. 

When the number of data points increases, we move to the right on the 
learning curves and both the generalization error and the variance term shrink, 
as expected. The learning curve also illustrates an important point about Fin. 
As N increases, Ein edges toward the smallest error that the learning model 
can achieve in approximating f. For small N, the value of Ein is actually 
smaller than that ‘smallest possible’ error. This is because the learning model 
has an easier task for smaller N; it only needs to approximate f on the N 
points regardless of what happens outside those points. Therefore, it can 
achieve a superior fit on those points, albeit at the expense of an inferior fit 
on the rest of the points as shown by the corresponding value of Fut. 


6For the learning curve, we take the expected values of all quantities with respect to D 
of size N. 
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Problem 2.1 In Equation (2.1), set 6 = 0.03 and let 


1 2M 
ln — 


e(M, N, 6) = oN 5 ’ 


(a) For M = 1, how many examples do we need to make e < 0.05? 
(b) For M = 100, how many examples do we need to make e < 0.05? 
(c) For M = 10,000, how many examples do we need to make e < 0.05? 


Problem 2.2 Show that for the learning model of positive rectangles 
(aligned horizontally or vertically), m(4) = 24 and m(5) < 25. Hence, give 
a bound for mu(N). 


Problem 2.3 Compute the maximum number of dichotomies, mu (N), 
for these learning models, and consequently compute dyc, the VC dimension. 


(a) Positive or negative ray: H contains the functions which are +1 on Ja, oo) 
(for some a) together with those that are +1 on (—oo, a] (for some a). 

(b) Positive or negative interval: H contains the functions which are +1 on 
an interval [a,b] and —1 elsewhere or —1 on an interval [a,b] and +1 
elsewhere. 


(c) Two concentric spheres in Rt: H contains the functions which are +1 for 


aL /r? +... +r} <b. 


Problem 2.4 Show that B(N, k) = $$ (7) by showing the other 
direction to Lemma 2.3, namely that 


To do so, construct a specific set of DT GA dichotomies that does not 
shatter any subset of k variables. [Hint: Try limiting the number of —1's in 


each dichotomy.] 


D 
Problem 2.5 Prove by induction that >> (Os < N? +1, hence 
i=0 


mu(N) < N™ +1. 
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Problem 2.6 Prove that for N > d, 


d d 
Oa 
=o." 
We suggest you first show the following intermediate steps. 


@) 5 (%)< H(X) (APS (4) (4) BY 


i=0 i=0 i=0 


et. [Hints: Binomial theorem; (1+ 4+)” < e for z >0.] 


€ 
M 
~ 
o. 2 
Ner 
3 
IA 


lA 
FeO 
ale 
6 |z 
Se 
Q 
< 
Q 


Hence, argue that mu (N) 


Problem 2.7 Plot the bounds for mu (N) given in Problems 2.5 and 2.6 
for dvo = 2 and dvo = 5. When do you prefer one bound over the other? 


Problem 2.8 Which of the following are possible growth functions mu (N) 
for some hypothesis set: 


1+N; yA) 1 2N; gl VN J. ol N/2), rnp AAW 2) 


Problem 2.9 [hard] For the perceptron in d dimensions, show that 


SSN el 
m(n) = 290 ( | | 
= 2 [Hint: Cover(1965) in Further Reading.] 


Use this formula to verify that dvo = d+ 1 by evaluating m(d + 1) and 
muld +2). Plot my(N)/2% for d = 10 and N € [1,40]. If you generate a 
random dichotomy on N points in 10 dimensions, give an upper bound on the 
probability that the dichotomy will be separable for N = 10, 20, 40. 


Problem 2.10 Show that mz(2N) < my(N)*, and hence obtain a 
generalization bound which only involves my (N). 


Problem 2.11 Suppose my(N) = N +1, so dvo = 1. You have 100 
training examples. Use the generalization bound to give a bound for Fout with 
confidence 90%. Repeat for N = 10, 000. 


70 


2. TRAINING VERSUS TESTING 2.4. PROBLEMS 


Problem 2.12 For an H with dvo = 10, what sample size do you need 
(as prescribed by the generalization bound) to have a 95% confidence that your 
generalization error is at most 0.05? 


Problem 2.13 
(a) Let H = {hi,ho,..., hm} with some finite M. Prove that dyvc(H) < 
log, M. 


(b) For hypothesis sets Hi, H2, ---, Hx with finite VC dimensions dyvc(Hzx), 
derive and prove the tightest upper and lower bound that you can get 
on dve (MHk). 


(c) For hypothesis sets H1, H2, +++, Hx with finite VC dimensions dyc(Hz), 
derive and prove the tightest upper and lower bounds that you can get 
on dve (Usui He). 


Problem 2.14 Let Hi, H2,..., Hx be K hypothesis sets with finite VC 
dimension dve. Let H = Hı UHe U::+U Hx be the union of these models. 


(a) Show that dvo(H) < K(dvc + 1). 
(b) Suppose that £ satisfies 2° > 2K €%°. Show that dvo(H) < £. 
(c) Hence, show that 


dvo(H) < min(K(dvc +1), 7(dve + K) logs (dyoK)). 


That is, dvo(H) = O(max(dyc, K) logy max(dyc, K)) is not too bad. 


Problem 2.15 The monotonically increasing hypothesis set is 
H = {h| xı > x2 => h(a) > hk(x2)}, 
where xı > x2 if and only if the inequality is satisfied for every component. 


(a) Give an example of a monotonic classifier in two dimensions, clearly show- 
ing the +1 and —1 regions. 


(b) Compute m+,(V) and hence the VC dimension. [Hint: Consider a set 
of N points generated by first choosing one point, and then generating the 
next point by increasing the first component and decreasing the second 
component until N points are obtained.] 
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Problem 2.16 In this problem, we will consider ¥ = R. That is, x = x 
is a one-dimensional variable. For a hypothesis set 


D è 
H= fn he(£) = sign (>: oa!) 


i=0 
prove that the VC dimension of H. is exactly (D + 1) by showing that 





(a) There are (D + 1) points which are shattered by H. 
(b) There are no (D + 2) points which are shattered by H. 


Problem 2.17 The VC dimension depends on the input space as well 
as H. For a fixed H, consider two input spaces X1 C X2. Show that the VC 
dimension of H with respect to input space %; is at most the VC dimension 
of H with respect to input space 1. 


How can the result of this problem be used to answer part (b) in Problem 2.16? 
[Hint: How is Problem 2.16 related to a perceptron in D dimensions?] 


Problem 2.18 The VC dimension of the perceptron hypothesis set 
corresponds to the number of parameters (wo, wi,°-: , Wa) of the set, and this 
observation is ‘usually’ true for other hypothesis sets. However, we will present 
a counter-example here. Prove that the following hypothesis set for x € R has 
an infinite VC dimension: 


H= {ha halz) = (—1)'°*), where a € R}, 





where |A] is the biggest integer < A (the floor function). This hypothesis 
has only one parameter æ but ‘enjoys’ an infinite VC dimension. [Hint: Con- 
sider z1,...,£&y, where £n = 10”, and show how to implement an arbitrary 
dichotomy y1,...,yn-] 


Problem 2.19 — This problem derives a bound for the VC dimension of a 
complex hypothesis set that is built from simpler hypothesis sets via composi- 
tion. Let H1,...,H« be hypothesis sets with VC dimension di,...,dx. Fix 
hi,...,hKx, where hi € Hi. Define a vector z obtained from x to have com- 
ponents hi(x). Note that x € R%, but z € {-1,+1}*. Let H be a hypothesis 
set of functions that take inputs in R™. So 


Re A: ze RŽ + {41,-1}, 


and suppose that H has VC dimension d. 


72 


2. TRAINING VERSUS TESTING 





We can apply a hypothesis in Ë to the z constructed from (hi,...,hx). This 
is the composition of the hypothesis set H with (H1,..., Hx). More formally, 
the composed hypothesis set H = Ho (H1,..., Hx) is defined by h € H if 


h(x) = h(hi(x),...,hx(x)), heh; hi © Hi. 


(a) Show that 


(b) 


(c) 


(d) 


K 
mau(N) < malN) | [m (N). (2.18) 

i=1 
[Hint: Fix N points x1,...,xw and fix hi,...,hx. This generates N 
transformed points Z1,...,2nN. These 21,...,znN can be dichotomized 
in at most mz(N) ways, hence for fixed (hı,... hg), (X1,...,Xn) 
can be dichotomized in at most mz(N) ways. Through the eyes of 
X1,...,Xn, at most how many hypotheses are there (effectively) in Hi? 


Use this bound to bound the effective number of K-tuples (hi,...,hK) 
that need to be considered. Finally, argue that you can bound the number 
of dichotomies that can be implemented by the product of the number 
of possible K-tuples (hi,...,hx) and the number of dichotomies per 
K-tuple.] 

dvo 
Use the bound m(N) < (£) to get a bound for mu (N) in terms of 
d,dı,..., dK. 
Le D=d+ eae di, and assume that D > 2elog, D. Show that 


dvo(H) < 2D log, D. 


If Hi and H are all perceptron hypothesis sets, show that 


dvo(H) = O(dK log(dK)). 


In the next chapter, we will further develop the simple linear model. This linear 
model is the building block of many other models, such as neural networks. 
The results of this problem show how to bound the VC dimension of the more 
complex models built in this manner. 


Problem 2.20 There are a number of bounds on the generalization 
error e, all holding with probability at least 1 — ô. 


(a) Original VC-bound: 


8, 4mu(2N) 


ee ee 
SSN N 5 


(b) Rademacher Penalty Bound: 


2In(2Nmz(N)) 2 1 1 
aa ieee, 
e< N + NPSN 


(continued on next page) 
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(c) Parrondo and Van den Broek: 


1 6mu (2N) 
< — — }. 
e< jå (2+1 5 ) 


cxf (sato +m MD). 


Note that (c) and (d) are implicit bounds in e. Fix dvo = 50 and 6 = 0.05 and 
plot these bounds as a function of N. Which is best? 





(d) Devroye: 





Problem 2.21 Assume the following theorem to hold 


Theorem 





y Eout(g) 


where c is a constant that is a little bigger than 6. 


Eout(g) — Fin(g) . | < c- mu (2N) exp CF) ' 


This bound is useful because sometimes what we care about is not the absolute 
generalization error but instead a relative generalization error (one can imagine 
that a generalization error of 0.01 is more significant when Hout = 0.01 than 
when Eout = 0.5). Convert this to a generalization bound by showing that 
with probability at least 1 — 6, 





4 Bin 
Bourg) < Ealo) +È |1+ 4/1 +" 0 |, 
where € = £ log cm) 
Problem 2.22 When there is noise in the data, Four(g) = 


Ex, y[(g P(x) — y(x))?], where y(x) = f(x) +e. If € is a zero-mean noise 
random variable with variance o”, show that the bias-variance decomposition 


becomes 
Ep[Eout (g9 P?)] = o? + bias + var. 


Problem 2.23 Consider the learning problem in Example 2.8, where the 
input space is X = [—1,+1], the target function is f(x) = sin(wx), and the 
input probability distribution is uniform on XY. Assume that the training set D 
has only two data points (picked independently), and that the learning algorithm 
picks the hypothesis that minimizes the in-sample mean squared error. In this 
problem, we will dig deeper into this case. 
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For each of the following learning models, find (analytically or numerically) 
(i) the best hypothesis that approximates f in the mean-squared-error sense 
(assume that f is known for this part), (ii) the expected value (with respect 
to D) of the hypothesis that the learning algorithm produces, and (iii) the 
expected out-of-sample error and its bias and var components. 


(a) The learning model consists of all hypotheses of the form h(x) = az +b 
(if you need to deal with the infinitesimal-probability case of two identical 
data points, choose the hypothesis tangential to f). 


(b) The learning model consists of all hypotheses of the form h(x) = az. 
This case was not covered in Example 2.8. 


(c) The learning model consists of all hypotheses of the form h(x) = b. 


Problem 2.24 Consider a simplified learning scenario. Assume that 
the input dimension is one. Assume that the input variable x is uniformly 
distributed in the interval [—1, 1]. The data set consists of 2 points {r1, £2} 
and assume that the target function is f(x) = x”. Thus, the full data set is 
D = {(a1, 27), (v2, £3)}. The learning algorithm returns the line fitting these 
two points as g (H consists of functions of the form h(x) = ax +b). We are 
interested in the test performance (Fout) of our learning system with respect 
to the squared error measure, the bias and the var. 


(a) Give the analytic expression for the average function g(x). 


(b) Describe an experiment that you could run to determine (numerically) 
g(x), Eout, bias, and var. 


(c) Run your experiment and report the results. Compare Hout with bias+var. 
Provide a plot of your g(x) and f(a) (on the same plot). 


(d) Compute analytically what Hout, bias and var should be. 
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Chapter 3 


The Linear Model 


We often wonder how to draw a line between two categories; right versus 
wrong, personal versus professional life, useful email versus spam, to name a 
few. A line is intuitively our first choice for a decision boundary. In learning, 
as in life, a line is also a good first choice. 

In Chapter 1, we (and the machine @)) learned a procedure to ‘draw a line’ 
between two categories based on data (the perceptron learning algorithm). We 
started by taking the hypothesis set H that included all possible lines (actually 
hyperplanes). The algorithm then searched for a good line in H by iteratively 
correcting the errors made by the current candidate line, in an attempt to 
improve Fin. As we saw in Chapter 2, the linear model — set of lines — has a 
small VC dimension and so is able to generalize well from Fin to Eout. 

The aim of this chapter is to further develop the basic linear model into a 
powerful tool for learning from data. We branch into three important prob- 
lems: the classification problem that we have seen and two other important 
problems called regression and probability estimation. The three problems 
come with different but related algorithms, and cover a lot of territory in 
learning from data. As a rule of thumb, when faced with learning problems, 
it is generally a winning strategy to try a linear model first. 


3.1 Linear Classification 


The linear model for classifying data into two classes uses a hypothesis set of 
linear classifiers, where each h has the form 


h(x) = sign(w"x), 


for some column vector w € R@*!, where d is the dimensionality of the input 
space, and the added coordinate x9 = 1 corresponds to the bias ‘weight’ wo 
(recall that the input space ¥ = {1} x R? is considered d-dimensional since 
the added coordinate zo = 1 is fixed). We will use h and w interchangeably 


TT 
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to refer to the hypothesis when the context is clear. When we left Chapter 1, 
we had two basic criteria for learning: 


1. Can we make sure that Eout(g) is close to Fin(g)? This ensures that what 
we have learned in sample will generalize out of sample. 


2. Can we make Ein(g) small? This ensures that what we have learned in 
sample is a good hypothesis. 


The first criterion was studied in Chapter 2. Specifically, the VC dimension 
of the linear model is only d+ 1 (Exercise 2.4). Using the VC generalization 
bound (2.12), and the bound (2.10) on the growth function in terms of the 
VC dimension, we conclude that with high probability, 


Fout(g) = Ein(g) F oy smn) i (3.1) 


Thus, when N is sufficiently large, Fin and Eout will be close to each other 
(see the definition of O(-) in the Notation table), and the first criterion for 
learning is fulfilled. 

The second criterion, making sure that Ej, is small, requires first and 
foremost that there is some linear hypothesis that has small Ein. If there 
isn’t such a linear hypothesis, then learning certainly can’t find one. So, let’s 
suppose for the moment that there is a linear hypothesis with small Fin. In 
fact, let’s suppose that the data is linearly separable, which means there is 
some hypothesis w* with Ei,(w*) = 0. We will deal with the case when this 
is not true shortly. 

In Chapter 1, we introduced the perceptron learning algorithm (PLA). 
Start with an arbitrary weight vector w(0). Then, at every time step t > 0, 
select any misclassified data point (x(t), y(t)), and update w(t) as follows: 


w(t +1) = w(t) + y(¢)x(t). 


The intuition is that the update is attempting to correct the error in classify- 
ing x(t). The remarkable thing is that this incremental approach of learning 
based on one data point at a time works. As discussed in Problem 1.3, it can be 
proved that the PLA will eventually stop updating, ending at a solution Wpra 
with Fin(Wpra) = 0. Although this result applies to a restricted setting (lin- 
early separable data), it is a significant step. The PLA is clever — it doesn’t 
naively test every linear hypothesis to see if it (the hypothesis) separates the 
data; that would take infinitely long. Using an iterative approach, the PLA 
manages to search an infinite hypothesis set and output a linear separator in 
(provably) finite time. 

As far as PLA is concerned, linear separability is a property of the data, 
not the target. A linearly separable D could have been generated either from 
a linearly separable target, or (by chance) from a target that is not linearly 
separable. The convergence proof of PLA guarantees that the algorithm will 
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(a) Few noisy data. (b) Nonlinearly separable. 


Figure 3.1: Data sets that are not linearly separable but are (a) linearly 
separable after discarding a few examples, or (b) separable by a more so- 
phisticated curve. 


work in both these cases, and produce a hypothesis with Ein = 0. Further, 
in both cases, you can be confident that this performance will generalize well 
out of sample, according to the VC bound. 


Exercise 3.1 
Will PLA ever stop updating if the data is not linearly separable? 


3.1.1 Non-Separable Data 


We now address the case where the data is not linearly separable. Figure 3.1 
shows two data sets that are not linearly separable. In Figure 3.1(a), the data 
becomes linearly separable after the removal of just two examples, which could 
be considered noisy examples or outliers. In Figure 3.1(b), the data can be 
separated by a circle rather than a line. In both cases, there will always be 
a misclassified training example if we insist on using a linear hypothesis, and 
hence PLA will never terminate. In fact, its behavior becomes quite unstable, 
and can jump from a good perceptron to a very bad one within one update; the 
quality of the resulting Fin cannot be guaranteed. In Figure 3.1(a), it seems 
appropriate to stick with a line, but to somehow tolerate noise and output a 
hypothesis with a small Ein, not necessarily Ein = 0. In Figure 3.1(b), the 
linear model does not seem to be the correct model in the first place, and we 
will discuss a technique called nonlinear transformation for this situation in 
Section 3.4. ) . 
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The situation in Figure 3.1(a) is actually encountered very often: even 
though a linear classifier seems appropriate, the data may not be linearly sep- 
arable because of outliers or noise. To find a hypothesis with the minimum Fin, 
we need to solve the combinatorial optimization problem: 


1 N 
min W y [sizsn(w"xn) Æ Yn] - (3.2) 
n=1 


weR¢d+1 





Ein(w) 


The difficulty in solving this problem arises from the discrete nature of both 
sign(-) and [-]. In fact, minimizing Ein (w) in (3.2) in the general case is known 
to be NP-hard, which means there is no known efficient algorithm for it, and 
if you discovered one, you would become really, really famous ©). Thus, one 
has to resort to approximately minimizing Fin. 

One approach for getting an approximate solution is to extend PLA through 
a simple modification into what is called the pocket algorithm. Essentially, the 
pocket algorithm keeps ‘in its pocket’ the best weight vector encountered up 
to iteration t in PLA. At the end, the best weight vector will be reported as 
the final hypothesis. This simple algorithm is shown below. 





The pocket algorithm: 
: Set the pocket weight vector w to w(0) of PLA. 
: for t= 0,..., T — 1 do 
Run PLA for one update to obtain w(t + 1). 


Evaluate Ein(w(t + 1)). 
If w(t + 1) is better than w in terms of Fin, set W to 
w(t + 1). 

: Return w. 





The original PLA only checks some of the examples using w(t) to identify 
(x(t), y(t)) in each iteration, while the pocket algorithm needs an additional 
step that evaluates all examples using w(t + 1) to get Ein(w(t+1)). The 
additional step makes the pocket algorithm much slower than PLA. In addi- 
tion, there is no guarantee for how fast the pocket algorithm can converge to a 
good Ein. Nevertheless, it is a useful algorithm to have on hand because of its 
simplicity. Other, more efficient approaches for obtaining good approximate 
solutions have been developed based on different optimization techniques, as 
shown later in this chapter. 


Exercise 3.2 


Take d = 2 and create a data set D of size N = 100 that is not linearly 
separable. You can do so by first choosing a random line in the plane as 
your target function and the inputs x, of the data set as random points 
in the plane. Then, evaluate the target function on each x, to get the 
corresponding output Yn. Finally, flip the labels of £ randomly selected 
Yn s and the data set will likely become non-separable. 
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Now, try the pocket algorithm on your data set using T = 1, 000 iterations. 
Repeat the experiment 20 times. Then, plot the average Ein(w(t)) and the 
average Ein(w) (which is also a function of t) on the same figure and see 
how they behave when ¢ increases. Similarly, use a test set of size 1,000 
and plot a figure to show how Fout(w(t)) and Eout (W) behave. 


Example 3.1 (Handwritten digit recognition). We sample some digits from 
the US Postal Service Zip Code Database. These 16 x 16 pixel images are 
preprocessed from the scanned handwritten zip codes. The goal is to recognize 
the digit in each image. We alluded to this task in part (b) of Exercise 1.1. 
A quick look at the images reveals that this is a non-trivial task (even for a 
human), and typical human Eout is about 2.5%. Common confusion occurs 
between the digits {4,9} and {2,7}. A machine-learned hypothesis which can 
achieve such an error rate would be highly desirable. 
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Let’s first decompose the big task of separating ten digits into smaller tasks of 
separating two of the digits. Such a decomposition approach from multiclass 
to binary classification is commonly used in many learning algorithms. We will 
focus on digits {1,5} for now. A human approach to determining the digit 
corresponding to an image is to look at the shape (or other properties) of the 
black pixels. Thus, rather than carrying all the information in the 256 pixels, 
it makes sense to summarize the information contained in the image into a few 
features. Let’s look at two important features here: intensity and symmetry. 
Digit 5 usually occupies more black pixels than digit 1, and hence the average 
pixel intensity of digit 5 is higher. On the other hand, digit 1 is symmetric 
while digit 5 is not. Therefore, if we define asymmetry as the average absolute 
difference between an image and its flipped versions, and symmetry as the 
negation of asymmetry, digit 1 would result in a higher symmetry value. A 
scatter plot for these intensity and symmetry features for some of the digits is 
shown next. 
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While the digits can be roughly separated by a line in the plane representing 
these two features, there are poorly written digits (such as the ‘5’ depicted in 
the top-left corner) that prevent a perfect linear separation. 

We now run PLA and pocket on the data set and see what happens. Since 
the data set is not linearly separable, PLA will not stop updating. In fact, 
as can be seen in Figure 3.2(a), its behavior can be quite unstable. When 
it is forcibly terminated at iteration 1,000, PLA gives a line that has a poor 
Ein = 2.24% and Eout = 6.37%. On the other hand, if the pocket algorithm is 
applied to the same data set, as shown in Figure 3.2(b), we can obtain a line 
that has a better Fi, = 0.45% and a better Eout = 1.89%. oO 


3.2 Linear Regression 


Linear regression is another useful linear model that applies to real-valued 
target functions.! It has a long history in statistics, where it has been studied 
in great detail, and has various applications in social and behavioral sciences. 
Here, we discuss linear regression from a learning perspective, where we derive 
the main results with minimal assumptions. 

Let us revisit our application in credit approval, this time considering a 
regression problem rather than a classification problem. Recall that the bank 
has customer records that contain information fields related to personal credit, 
such as annual salary, years in residence, outstanding loans, etc. Such variables 
can be used to learn a linear classifier to decide on credit approval. Instead of 
just making a binary decision (approve or not), the bank also wants to set a 
proper credit limit for each approved customer. Credit limits are traditionally 
determined by human experts. The bank wants to automate this task, as it 
did with credit approval. 


1 Regression, a term inherited from earlier work in statistics, means y is real-valued. 
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Error (log scale) 
Error (log scale) 





0 250 500 750 1000 0 250 500 750 1000 
Iteration Number, t Iteration Number, t 


Symmetry 
Symmetry 
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(a) PLA (b) Pocket 


Figure 3.2: Comparison of two linear classification algorithms for sep- 
arating digits 1 and 5. Fin and Fout are plotted versus iteration number 
and below that is the learned hypothesis g. (a) A version of the PLA which 
selects a random training example and updates w if that example is misclas- 
sified (hence the flat regions when no update is made). This version avoids 
searching all the data at every iteration. (b) The pocket algorithm. 


This is a regression learning problem. The bank uses historical records to 
construct a data set D of examples (x1, 91), (X2,y2),---, (XN, Yn), Where Xp, 
is customer information and y, is the credit limit set by one of the human 
experts in the bank. Note that yn is now a real number (positive in this case) 
instead of just a binary value +1. The bank wants to use learning to find a 
hypothesis g that replicates how human experts determine credit limits. 


Since there is more than one human expert, and since each expert may 
not be perfectly ‘consistent, our target will not be a deterministic function 
y = f(x). Instead, it will be a noisy target formalized as a distribution of the 
random variable y that comes from the different views of different experts as 
well as the variation within the views of each expert. That is, the label yn 
comes from some distribution P(y | x) instead of a deterministic function f(x). 
Nonetheless, as we discussed in previous chapters, the nature of the problem 
is not changed. We have an unknown distribution P(x,y) that generates 
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each (Xn, Yn), and we want to find a hypothesis g that minimizes the error 
between g(x) and y with respect to that distribution. 

The choice of a linear model for this problem presumes that there is a linear 
combination of the customer information fields that would properly approx- 
imate the credit limit as determined by human experts. If this assumption 
does not hold, we cannot achieve a small error with a linear model. We will 
deal with this situation when we discuss nonlinear transformation later in the 
chapter. 


3.2.1 The Algorithm 


The linear regression algorithm is based on minimizing the squared error be- 
tween h(x) and y.? 


Boue(h) =E |(h(x) — w)"], 


where the expected value is taken with respect to the joint probability distri- 
bution P(x, y). The goal is to find a hypothesis that achieves a small Eout(h). 
Since the distribution P(x, y) is unknown, Eout(h) cannot be computed. Sim- 
ilar to what we did in classification, we resort to the in-sample version instead, 


1 N 
Fin(h) = N ye (h(xn) = TAE . 


In linear regression, h takes the form of a linear combination of the components 
of x. That is, 


d 
h(x) = X witi = wx, 
i=0 


where zp = 1 and x € {1} x R? as usual, and w € R®!. For the special case 
of linear h, it is very useful to have a matrix representation of Ej,(h). First, 
define the data matrix X € RY*(¢+) to be the N x (d+1) matrix whose rows 
are the inputs x, as row vectors, and define the target vector y € RY to be 
the column vector whose components are the target values yn. The in-sample 
error is a function of w and the data X, y: 


N 
1 2 
Bin(w) = 2 (WXn — Yn) 
n=1 

1 2 

= <|Xw-y| (3.3) 
1 

= yw XXw — 2w*X"y +y’y), (3.4) 

where || - || is the Euclidean norm of a vector, and (3.3) follows because the nth 


component of the vector Xw — y is exactly w7x, — yn. The linear regression 


?The term ‘linear regression’ has been historically confined to squared error measures. 
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(a) one dimension (line) (b) two dimensions (hyperplane) 


Figure 3.3: The solution hypothesis (in blue) of the linear regression algo- 
rithm in one and two dimensions. The sum of squared errors is minimized. 


algorithm is derived by minimizing Ein(w) over all possible w € Ritt, as 
formalized by the following optimization problem: 


Wiin = argmin Ein(w). (3.5) 
weR¢t1 

Figure 3.3 illustrates the solution in one and two dimensions. Since Equa- 
tion (3.4) implies that Ejn(w) is differentiable, we can use standard matrix 
calculus to find the w that minimizes Ej,(w) by requiring that the gradient 
of Ein with respect to w is the zero vector, i.e., VEin(w) = 0. The gradient is 
a (column) vector whose ith component is [VEin(w)]|; = 50, Ein(w). By ex- 
plicitly computing Bo the reader can verify the following gradient identities, 


Vw(w'Aw) =(A+A")w,  Vw(w"b) =b. 


These identities are the matrix analog of ordinary differentiation of quadratic 
and linear functions. To obtain the gradient of Ein, we take the gradient of 
each term in (3.4) to obtain 


VEin(w) = L (X"Xw — X"y). 


Note that both w and VEin(w) are column vectors. Finally, to get VEin(w) 
to be 0, one should solve for w that satisfies 


X™Xw = X’y. 


If XTX is invertible, w = Xty where Xt = (X™X)~1!X7 is the pseudo-inverse 
of X. The resulting w is the unique optimal solution to (3.5). If XTX is not 
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invertible, a pseudo-inverse can still be defined, but the solution will not be 
unique (see Problem 3.15). In practice, XTX is invertible in most of the cases 
since N is often much bigger than d+ 1, so there will likely be d+ 1 linearly 
independent vectors Xn. We have thus derived the following linear regression 
algorithm. 












Linear regression algorithm: 

1: Construct the matrix X and the vector y from the data set 

(x1,41),:':,(%N, yn), where each x includes the zp = 1 
bias coordinate, as follows 





T 
xy yı 

T 
XQ Y2 

X = f 5 y = 

T 

—_ xy YN 
input data matrix target vector 


2: Compute the pseudo-inverse Xİ of the matrix X. If X™X 


is invertible, 
lrs K 








3: Return win = Xty. 


This algorithm is sometimes referred to as ordinary least squares (OLS). It may 
seem that, compared with the perceptron learning algorithm, linear regression 
doesn’t really look like ‘learning’, in the sense that the hypothesis wii, comes 
from an analytic solution (matrix inversion and multiplications) rather than 
from iterative learning steps. Well, as long as the hypothesis win has a decent 
out-of-sample error, then learning has occurred. Linear regression is a rare 
case where we have an analytic formula for learning that is easy to evaluate. 
This is one of the reasons why the technique is so widely used. It should 
be noted that there are methods for computing the pseudo-inverse directly 
without inverting a matrix, and that these methods are numerically more 
stable than matrix inversion. 

Linear regression has been analyzed in great detail in statistics. We would 
like to mention one of the analysis tools here since it relates to in-sample and 
out-of-sample errors, and that is the hat matric H. Here is how H is defined. 
The linear regression weight vector Wiin is an attempt to map the inputs X 
to the outputs y. However, Wiin does not produce y exactly, but produces an 
estimate 

y F XWin 
which differs from y due to in-sample error. Substituting the expression 
for Wiin (assuming XTX is invertible), we get 


y = X(X"X) tXTy. 
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Therefore the estimate y is a linear transformation of the actual y through 
matrix multiplication with H, where 


H=X(X™X)" EX’. (3.6) 


Since y = Hy, the matrix H ‘puts a hat’ on y, hence the name. The hat 
matrix is a very special matrix. For one thing, H? = H, which can be verified 
using the above expression for H. This and other properties of H will facilitate 
the analysis of in-sample and out-of-sample errors of linear regression. 


Exercise 3.3 
Consider the hat matrix H = X(X™X)7'X7, where X is an N by d+1 
matrix, and XTX is invertible. 
(a) Show that H is symmetric. 
(b) Show that H® = H for any positive integer K. 
(c) If Lis the identity matrix of size N, show that (I — H)* =1—H for 
any positive integer K. 


(d) Show that trace(H) = d+ 1, where the trace is the sum of diagonal 
elements. [Hint: trace(AB) = trace(BA).] 


3.2.2 Generalization Issues 


Linear regression looks for the optimal weight vector in terms of the in-sample 
error Fin, which leads to the usual generalization question: Does this guarantee 
decent out-of-sample error Zout? The short answer is yes. There is a regression 
version of the VC generalization bound (3.1) that similarly bounds Eout. In 
the case of linear regression in particular, there are also exact formulas for 
the expected Fou; and Ein that can be derived under simplifying assumptions. 
The general form of the result is 


Eula) = Bila) + (+) 


where Eout(g) and Fin(g) are the expected values. This is comparable to the 
classification bound in (3.1). 


Exercise 3.4 


Consider a noisy target y = w*'x + € for generating the data, where e is 
a noise term with zero mean and g? variance, independently generated for 
every example (x,y). The expected error of the best possible linear fit to 
this target is thus o°. 


For the data D = {(x1,y1),..., (xn, yn) }, denote the noise in yn as En 
and let € = [e1, €2,...,€n]"; assume that X*X is invertible. By following 


(continued on next page) 
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the steps below, show that the expected in-sample error of linear regression 
with respect to D is given by 


Ep[Ein(Wiin)] = 0° (1 = st) . 


(a) Show that the in-sample estimate of y is given by y = Xw* + He. 


(b). Show that the in-sample error vector ¥ — y can be expressed by a 
matrix times e€. What is the matrix? 

(c) Express Ein (Wiin) in terms of € using (b), and simplify the expression 
using Exercise 3.3(c). 

(d) Prove that Ep[Fin(wiin)] = 07 (1 — 44+) using (c) and the indepen- 
dence of €1,--- ,en. [Hint: The sum of the diagonal elements of a 
matrix (the trace) will play a role. See Exercise 3.3(d).] 


For the expected out-of-sample error, we take a special case which is easy to 
analyze. Consider a test data set Drest = {(X1, yi),---;(Xw,Yn)}, which 
shares the same input vectors x, with D but with a different realization of 
the noise terms. Denote the noise in yn as €h and let €’ = [e},€,...,€y]"- 
Define Fest (Wiin) to be the average squared error on Diest. 


(e) Prove that Ep, </[Etest(wiin)] = o° (1+ 44). 


The special test error Fest is a very restricted case of the general out- 
of-sample error. Some detailed analysis shows that similar results can be 
obtained for the general case, as shown in Problem 3.11. 


Figure 3.4 illustrates the learning curve of linear regression under the assump- 
tions of Exercise 3.4. The best possible linear fit has expected error a”. The 
expected in-sample error is smaller, equal to o?(1 — att) for N > d+1. The 
learned linear fit has eaten into the in-sample noise as much as it could with 
the d + 1 degrees of freedom that it has at its disposal. This occurs because 
the fitting cannot distinguish the noise from the ‘signal.’ On the other hand, 
the expected out-of-sample error is 07(1 + ah), which is more than the un- 
avoidable error of g?. The additional error reflects the drift in wi, due to 
fitting the in-sample noise. 


3.3 Logistic Regression 


The core of the linear model is the ‘signal’ s = wx that combines the input 
variables linearly. We have seen two models based on this signal, and we are 
now going to introduce a third. In linear regression, the signal itself is taken 
as the output, which is appropriate if you are trying to predict a real response 
that could be unbounded. In linear classification, the signal is thresholded 
at zero to produce a +1 output, appropriate for binary decisions. A third 
possibility, which has wide application in practice, is to output a probability, 
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Expected Error 





d+ Number of Data Points, N 


Figure 3.4: The learning curve for linear regression. 


a value between 0 and 1. Our new model is called logistic regression. It has 
similarities to both previous models, as the output is real (like regression) but 
bounded (like classification). 


Example 3.2 (Prediction of heart attacks). Suppose we want to predict the 
occurrence of heart attacks based on a person’s cholesterol level, blood pres- 
sure, age, weight, and other factors. Obviously, we cannot predict a heart 
attack with any certainty, but we may be able to predict how likely it is to 
occur given these factors. Therefore, an output that varies continuously be- 
tween 0 and 1 would be a more suitable model than a binary decision. The 
closer y is to 1, the more likely that the person will have a heart attack. OU 


3.3.1 Predicting a Probability 
Linear classification uses a hard threshold on the signal s = w’x, 

h(x) = sign(w’x), 
while linear regression uses no threshold at all, 

h(x) = w'x. 
In our new model, we need something in between these two cases that smoothly 
restricts the output to the probability range [0,1]. One choice that accom- 
plishes this goal is the logistic regression model, 
h(x) = 0(w"x), 


where @ is the so-called logistic function 0(s) = ie whose output is between 0 
and 1. 
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The output can be interpreted as a probabil- 
ity for a binary event (heart attack or no heart 
attack, digit ‘1’ versus digit ‘5’, etc.). Linear 
classification also deals with a binary event, but 
the difference is that the ‘classification’ in logis- 
tic regression is allowed to be uncertain, with 
intermediate values between 0 and 1 reflecting 
this uncertainty. The logistic function @ is referred to as a soft threshold, in 
contrast to the hard threshold in classification. It is also called a sigmoid 
because its shape looks like a flattened out ‘s’. 











Exercise 3.5 
Another popular soft threshold is the hyperbolic tangent 


ee —e § 


tanh(s) = reper f 


(a) How is tanh related to the logistic function 0? [Hint: shift and scale] 


(b) Show that tanh(s) converges to a hard threshold for large |s|, and 
converges to no threshold for small |s|. [Hint: Formalize the figure 
below.] 





The specific formula of 0(s) will allow us to define an error measure for learning 
that has analytical and computational advantages, as we will see shortly. Let 
us first look at the target that logistic regression is trying to learn. The target 
is a probability, say of a patient being at risk for heart attack, that depends 
on the input x (the characteristics of the patient). Formally, we are trying to 
learn the target function 


f(x) = Ply = +1 | x]. 


The data does not give us the value of f explicitly. Rather, it gives us samples 
generated by this probability, e.g., patients who had heart attacks and patients 
who didn’t. Therefore, the data is in fact generated by a noisy target P(y | x), 


_ J Fx) for y = +1; 
Pg= f; — f(x) fory=-l. eg 


To learn from such data, we need to define a proper error measure that gauges 
how close a given hypothesis h is to f in terms of these noisy +1 examples. 
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Error measure. The standard error measure e(h(x),y) used in logistic re- 
gression is based on the notion of likelihood; how ‘likely’ is it that we would get 
this output y from the input x if the target distribution P(y | x) was indeed 
captured by our hypothesis h(x)? Based on (3.7), that likelihood would be 


paas h(x) for y = +1; 
d ~ }1=A(x) for y=—-1. 


We substitute for h(x) by its value 0(w"x), and use the fact that 1 — @(s) = 
6(—s) (easy to verify) to get 


Ply |x) = 0(y w'x). (3.8) 


One of our reasons for choosing the mathematical form 6(s) = e*/(1 + e°) is 
that it leads to this simple expression for P(y | x). 

Since the data points (x1, y1),...,(Xw,yn) are independently generated, 
the probability of getting all the y,,’s in the data set from the correspond- 
ing x,,’s would be the product 


N 
ITP (Yn | Xn). 


The method of mazimum likelihood selects the hypothesis h which maximizes 
this probability.2 We can equivalently minimize a more convenient quantity, 


-5m (I Po 0 )- run(s Te. 


since ‘—Z In(-)’ is a monotonically decreasing function. Substituting with 
Equation (3.8), we would be minimizing 


#2" (waa) 


with respect to the weight vector w. The fact that we are minimizing this 
quantity allows us to treat it as an ‘error measure.’ Substituting the func- 
tional form for 6(y,w7X,) produces the in-sample error measure for logistic 
regression, 


N 
ETE = Xin (14 emewa), (3.9) 
n=l 


The implied pointwise error measure is e(h(Xn), yn) = In(1t+e7#” *"). Notice 
that this error measure is small when y,w’x, is large and positive, which 
would imply that sign(w7x,,) = Yn. Therefore, as our intuition would expect, 
the error measure encourages w to ‘classify’ each Xp correctly. 


3 Although the method of maximum likelihood is intuitively plausible, its rigorous justi- 
fication as an inference tool continues to be discussed in the statistics community. 
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Exercise 3.6 [Cross-entropy error measure] 


(a) More generally, if we are learning from +1 data to predict a noisy 
target P(y | x) with candidate hypothesis h, show that the maximum 
likelihood method reduces to the task of finding h that minimizes 


Ein(w) = X` [yn = 41] ne) es nay: 


(b) For the case h(x) = 6(w’x), argue that minimizing the in-sample 
error in part (a) is equivalent to minimizing the one in (3.9). 


For two probability distributions {p, 1— p} and {q, 1 — q} with binary out- 
comes, the cross-entropy (from information theory) is 





plog = +.(1 =p) log = 


The in-sample error in part (a) corresponds to a cross-entropy error measure 
on the data point (Xn, Yn), with p = [yn = +1] and q = h(xn). 


For linear classification, we saw that minimizing Fj, for the perceptron is a 
combinatorial optimization problem; to solve it, we introduced a number of al- 
gorithms such as the perceptron learning algorithm and the pocket algorithm. 
For linear regression, we saw that training can be done using the analytic 
pseudo-inverse algorithm for minimizing Ein by setting VEin(w) = 0. These 
algorithms were developed based on the specific form of linear classification or 
linear regression, so none of them would apply to logistic regression. 

To train logistic regression, we will take an approach similar to linear re- 
gression in that we will try to set VEj,(w) = 0. Unfortunately, unlike the case 
of linear regression, the mathematical form of the gradient of Fin for logistic 
regression is not easy to manipulate, so an analytic solution is not feasible. 


Exercise 3.7 


For logistic regression, show that 


z 


V Ein (w) = —- a 


= E YnXn0(—YnW Xn). 


Argue that a ‘misclassified’ example contributes more to the gradient than 
a correctly classified one. 


Instead of analytically setting the gradient to zero, we will iteratively set it to 
zero. To do so, we will introduce a new algorithm, gradient descent. Gradient 
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descent is a very general algorithm that can be used to train many other 
learning models with smooth error measures. For logistic regression, gradient 
descent has particularly nice properties. 


3.3.2 Gradient Descent 


Gradient descent is a general technique for 

minimizing a twice-differentiable function, such 

as E;,(w) in logistic regression. A useful phys- 

ical analogy of gradient descent is a ball rolling 

down a hilly surface. If the ball is placed on 

a hill, it will roll down, coming to rest at the 

bottom of a valley. The same basic idea under- 

lies gradient descent. Ein(w) is a ‘surface’ in 

a high-dimensional space. At step 0, we start 

somewhere on this surface, at w(0), and try to 

roll down this surface, thereby decreasing Ein. One thing which you imme- 
diately notice from the physical analogy is that the ball will not necessarily 
come to rest in the lowest valley of the entire surface. Depending on where 
you start the ball rolling, you will end up at the bottom of one of the valleys — 
a local minimum. In general, the same applies to gradient descent. Depending 
on your starting weights, the path of descent will take you to a local minimum 
in the error surface. 

A particular advantage for logistic regression 
with the cross-entropy error is that the picture 
looks much nicer. There is only one valley! So, 
it does not matter where you start your ball 
rolling, it will always roll down to the same 
(unique) global minimum. This is a consequence 
of the fact that Ein(w) is a conver function 
of w, a mathematical property that implies a 
single ‘valley’ as shown to the right. This means Weights, w 
that gradient descent will not be trapped in lo- 
cal minima when minimizing such convex error measures. 

Let’s now determine how to ‘roll’ down the Ein-surface. We would like to 
take a step in the direction of steepest descent, to gain the biggest bang for 
our buck. Suppose that we take a small step of size 7 in the direction of a unit 
vector ¥. The new weights are w(0) + 7¥. Since 7 is small, using the Taylor 
expansion to first order, we compute the change in Fin as 


In-sample Error, Fin 


4 


AE; = Fin (w(0) + nv) = Ein (w(0)) 
= nVEnlw(0))"® + O(n’) 
> =n||VEn(w(0))l]; 





4In fact, the squared in-sample error in linear regression is also convex, which is why the 
analytic solution found by the pseudo-inverse is guaranteed to have optimal in-sample error. 
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where we have ignored the small term O(7). Since ¥ is a unit vector, equality 
holds if and only if 
V Ein(w(0)) 


YS PVE (WOT ies 


This direction, specified by V7, leads to the largest decrease in E;, for a given 
step size 7. 


Exercise 3.8 


The claim that © is the direction which gives largest decrease in Ein only 
holds for small 7. Why? 


There is nothing to prevent us from continuing to take steps of size n, re- 
evaluating the direction ¥; at each iteration t = 0,1,2,.... How large a step 
should one take at each iteration? This is a good question, and to gain some 
insight, let’s look at the following examples. 




















s E g 
Ry Ry R 
H H H 
[o] Q O 
E E E 
cal eal (sal 
a S 4 small 7 
an Q en 
g g g 
fas] fas] iss} 
{ i 
4 4A A 
Weights, w Weights, w Weights, w 
7 too small 7 too large variable 7 — just right 


A fixed step size (if it is too small) is inefficient when you are far from the 
local minimum. On the other hand, too large a step size when you are close to 
the minimum leads to bouncing around, possibly even increasing Ein. Ideally, 
we would like to take large steps when far from the minimum to get in the 
right ballpark quickly, and then small (more careful) steps when close to the 
minimum. A simple heuristic can accomplish this: far from the minimum, 
the norm of the gradient is typically large, and close to the minimum, it is 
small. Thus, we could set m = 7||VEin|| to obtain the desired behavior for 
the variable step size; choosing the step size proportional to the norm of the 
gradient will also conveniently cancel the term normalizing the unit vector ¥ in 
Equation (3.10), leading to the fired learning rate gradient descent algorithm 
for minimizing Ein (with redefined 7): 
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Fixed learning rate gradient descent: 
1: Initialize the weights at time step t = 0 to w(0). 
2: for t = 0,1,2,... do 
3: Compute the gradient g; = V Em(w(t)). 
4 Set the direction to move, v; = — gz. 
5: Update the weights: w(t + 1) = w(t) + vz. 
6:  Iterate to the next step until it is time to stop. 
7: Return the final weights. 








In the algorithm, v; is a direction that is no longer restricted to unit length. 
The parameter 7 (the learning rate) has to be specified. A typically good 
choice for 7 is around 0.1 (a purely practical observation). To use gradient 
descent, one must compute the gradient. This can be done explicitly for logistic 
regression (see Exercise 3.7). 


Example 3.3. Gradient descent is a general algorithm for minimizing twice- 
differentiable functions. We can apply it to the logistic regression in-sample 
error to return weights that approximately minimize 


N 
1 T 
Ein(w) = NW ) In (1 +e Yaw a f 
n=1 








Logistic regression algorithm: 


1: Initialize the weights at time step t = 0 to w(0). 
2: for t = 0,1,2,... do 
3: Compute the gradient 


N 
Me -E YnXn 
St = N L 1H enw Ean 
n=1 
Set the direction to move, V = — gt. 


Update the weights: w(t + 1) = w(t) + nve. 
Iterate to the next step until it is time to stop. 
: Return the final weights w. 


we oO 





EJ 


Initialization and termination. We have two more loose ends to tie: the 
first is how to choose w(0), the initial weights, and the second is how to 
set the criterion for “...until it is time to stop” in step 6 of the gradient 
descent algorithm. In some cases, such as logistic regression, initializing the 
weights w(0) as zeros works well. However, in general, it is safer to initialize 
the weights randomly, so as to avoid getting stuck on a perfectly symmetric 
hilltop. Choosing each weight independently from a Normal distribution with 
zero mean and small variance usually works well in practice. 
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That takes care of initialization, so we now move on to termination. How 
do we decide when to stop? Termination is a non-trivial topic in optimization. 
One simple approach, as we encountered in the pocket algorithm, is to set an 
upper bound on the number of iterations, where the upper bound is typically 
in the thousands, depending on the amount of training time we have. The 
problem with this approach is that there is no guarantee on the quality of the 
final weights. 

Another plausible approach is based on the gradient being zero at any min- 
imum. A natural termination criterion would be to stop once ||g;|| drops below 
a certain threshold. Eventually this must happen, but we do not know when 
it will happen. For logistic regression, a combination of the two conditions 
(setting a large upper bound for the number of iterations, and a small lower 
bound for the size of the gradient) usually works well in practice. 

There is a problem with relying solely on 
the size of the gradient to stop, which is that 
you might stop prematurely as illustrated on the 3 
right. When the iteration reaches a relatively S) 
flat region (which is more common than you 
might suspect), the algorithm will prematurely 
stop when we may want to continue. So one so- 
lution is to require that termination occurs only 
if the error change is small and the error itself is small. Ultimately a combina- 
tion of termination criteria (a maximum number of iterations, marginal error 
improvement, coupled with small value for the error itself) works reasonably 
well. 





Weights, w 


Example 3.4. By way of summarizing linear models, we revisit our old friend 
the credit example. If the goal is to decide whether to approve or deny, then 
we are in the realm of classification; if you want to assign an amount of credit 
line, then linear regression is appropriate; if you want to predict the probability 


that someone will default, use logistic regression. 
Amount Einear Rezressiot Squared Error 
of Credit = Brees Pseudo-inverse 


Credit 
Analysis 
Probability ey vee Cross-entropy Error 
of Default Logistic Regression Gradient descent 


The three linear models have their respective goals, error measures, and al- 
gorithms. Nonetheless, they not only share similar sets of linear hypotheses, 
but are in fact related in other ways. We would like to point out one impor- 
tant relationship: Both logistic regression and linear regression can be used in 
linear classification. Here is how. 

Logistic regression produces a final hypothesis g(x) which is our estimate 
of Ply = +1 | x]. Such an estimate can easily be used for classification by 






Approve Percévtiroñ Classification Error 
or Deny P PLA, Pocket,... 
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setting a threshold on g(x); a natural threshold is $, which corresponds to 


classifying +1 if +1 is more likely. This choice for threshold corresponds to 
using the logistic regression weights as weights in the perceptron for classifica- 
tion. Not only can logistic regression weights be used for classification in this 
way, but they can also be used as a way to train the perceptron model. The 
perceptron learning problem (3.2) is a very hard combinatorial optimization 
problem. The convexity of Fin in logistic regression makes the optimization 
problem much easier to solve. Since the logistic function is a soft version of 
a hard threshold, the logistic regression weights should be good weights for 
classification using the perceptron. 

A similar relationship exists between classification and linear regression. 
Linear regression can be used with any real-valued target function, which 
includes real values that are +1. If wj,x is fit to +1 values, sign(wj,,x) will 
likely agree with these values and make good classification predictions. In 
other words, the linear regression weights Wiin, which are easily computed 
using the pseudo-inverse, are also an approximate solution for the perceptron 
model. The weights can be directly used for classification, or used as an initial 
condition for the pocket algorithm to give it a head start. O 


Exercise 3.9 


Consider pointwise error measures eclass (s, y) = [y Æ sign(s)], esq(s, y) = 
(y — s)°, and eiog(s, y) = ln(1 + exp(—ys)), where the signal s = w’x. 


1 
(a) For y=-+1, plot eclass, €sq and jog versus s, on the same plot. 


(b) Show that eciass(S, Y) < @sq(s,y), and hence that the classification 
error is upper bounded by the squared error. 

(c) Show that eciass(s,y) < j2yelog(s,y), and, as in part (b), get an 
upper bound (up to a constant factor) using the logistic regression 
error. 


These bounds indicate that minimizing the squared or logistic regression 
error should also decrease the classification error, which justifies using the 
weights returned by linear or logistic regression as approximations for clas- 
sification. 


Stochastic gradient descent. The version of gradient descent we have de- 
scribed so far is known as batch gradient descent — the gradient is computed 
for the error on the whole data set before a weight update is done. A sequen- 
tial version of gradient descent known as stochastic gradient descent (SGD) 
turns out to be very efficient in practice. Instead of considering the full batch 
gradient on all N training data points, we consider a stochastic version of 
the gradient. First, pick a training data point (xn, Yn) uniformly at random 
(hence the name ‘stochastic’), and consider only the error on that data point 
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(in the case of logistic regression), 
en(w) = In (1 + ees) 


The gradient of this single data point’s error is used for the weight update in 
exactly the same way that the gradient was used in batch gradient descent. 
The gradient needed for the weight update of SGD is (see Exercise 3.7) 


A —YnXkn 
Ven(w) = 1 + e¥nw'xn , 
and the weight update is w + w—7Ve,(w). Insight into why SGD works can 
be gained by looking at the expected value of the change in the weight (the 
expectation is with respect to the random point that is selected). Since n is 
picked uniformly at random from {1,..., N}, the expected weight change is 


1 N 
si ya a 


This is exactly the same as the deterministic weight change from the batch 
gradient descent weight update. That is, ‘on average’ the minimization pro- 
ceeds in the right direction, but is a bit wiggly. In the long run, these random 
fluctuations cancel out. The computational cost is cheaper by a factor of N, 
though, since we compute the gradient for only one point per iteration, rather 
than for all N points as we do in batch gradient descent. 

Notice that SGD is similar to PLA in that it decreases the error with re- 
spect to one data point at a time. Minimizing the error on one data point may 
interfere with the error on the rest of the data points that are not considered 
at that iteration. However, also similar to PLA, the interference cancels out 
on average as we have just argued. 


Exercise 3.10 


(a) Define an error for a single data point (xn, yn) to be 
€n(w) = max(0, —YynW Xn). 


Argue that PLA can be viewed as SGD on en with learning rate 7 = 1. 

(b) For logistic regression with a very large w, argue that minimizing Ein 
using SGD is similar to PLA. This is another indication that the lo- 
gistic regression weights can be used as a good approximation for 
classification. 


SGD is successful in practice, often beating the batch version and other more 
sophisticated algorithms. In fact, SGD was an important part of the algorithm 
that won the million-dollar Netflix competition, discussed in Section 1.1. It 
scales well to large data sets, and is naturally suited to online learning, where 
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a stream of data present themselves to the learning algorithm sequentially. 
The randomness introduced by processing one data point at a time can be a 
plus, helping the algorithm to avoid flat regions and local minima in the case 
of a complicated error surface. However, it is challenging to choose a suit- 
able termination criterion for SGD. A good stopping criterion should consider 
the total error on all the data, which can be computationally demanding to 
evaluate at each iteration. 


3.4 Nonlinear Transformation 


All formulas for the linear model have used the sum 
d 
wx = >: Witi (3.11) 
i=0 


as the main quantity in computing the hypothesis output. This quantity is 
linear, not only in the 2;’s but also in the w,’s. A closer inspection of the 
corresponding learning algorithms shows that the linearity in w;’s is the key 
property for deriving these algorithms; the x;’s are just constants as far as the 
algorithm is concerned. This observation opens the possibility for allowing 
nonlinear versions of x;’s while still remaining in the analytic realm of linear 
models, because the form of Equation (3.11) remains linear in the w; param- 
eters. 

Consider the credit limit problem for instance. It makes sense that the 
‘years in residence’ field would affect a person’s credit since it is correlated with 
stability. However, it is less plausible that the credit limit would grow linearly 
with the number of years in residence. More plausibly, there is a threshold 
(say 1 year) below which the credit limit is affected negatively and another 
threshold (say 5 years) above which the credit limit is affected positively. If x; 
is the input variable that measures years in residence, then two nonlinear 
‘features’ derived from it, namely |z; < 1] and [z; > 5], would allow a linear 
formula to reflect the credit limit better. 

We have already seen the use of features in the classification of handwritten 
digits, where intensity and symmetry features were derived from input pixels. 
Nonlinear transforms can be further applied to those features, as we will see 
shortly, creating more elaborate features and improving the performance. The 
scope of linear methods expands significantly when we represent the input by 
a set of appropriate features. 


3.4.1 The Z Space 


Consider the situation in Figure 3.1(b) where a linear classifier can’t fit the 
data. By transforming the inputs 21, x2 in a nonlinear fashion, we will be able 
to separate the data with more complicated boundaries while still using the 
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simple PLA as a building block. Let’s start by looking at the circle in Fig- 
ure 3.5(a), which is a replica of the non-separable case in Figure 3.1(b). The 
circle represents the following equation: 


a? + a2 = 0.6. 


That is, the nonlinear hypothesis h(x) = sign(—0.6 + x? + 2%) separates the 

data set perfectly. We can view the hypothesis as a linear one after applying 

a nonlinear transformation on x. In particular, consider z = 1, z1 = x? and 
2 

Z2 = T3, 


h(x) = sign ļ|(—0.6) 1 + 1-22 + 1i- z 
Sa ne awo wo A 
Wo Zo Wy 21 W2 22 

| 

= sign [wo Wi We] Zi 

22 


Z 


—— a 
wr 
where the vector z is obtained from x through a nonlinear transform ®, 
z = B(x). 


We can plot the data in terms of z instead of x, as depicted in Figure 3.5(b). 
For instance, the point x; in Figure 3.5(a) is transformed to the point zı in 
Figure 3.5(b) and the point x2 is transformed to the point zə. The space Z, 
which contains the z vectors, is referred to as the feature space since its coor- 
dinates are higher-level features derived from the raw input x. We designate 
different quantities in Z with a tilde version of their counterparts in X, e.g., 
the dimensionality of Z is d and the weight vector is w.> The transform ® 
that takes us from ¥ to Z is called a feature transform, which in this case is 


(x) zE (1, x7, 23). (3.12) 


In general, some points in the Z space may not be valid transforms of any 
x € X, and multiple points in Y may be transformed to the same z € Z, 
depending on the nonlinear transform ®. 

The usefulness of the transform above is that the nonlinear hypothesis h 
(circle) in the ¥ space can be represented by a linear hypothesis (line) in 
the Z space. Indeed, any linear hypothesis h in z corresponds to a (possibly 
nonlinear) hypothesis of x given by 


h(x) = h(®(x)). 


5Z = {1} x RÍ, where d = 2 in this case. We treat Z as d-dimensional since the added 
coordinate z = 1 is fixed. 
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0 0.5 1 


(b) Transformed data in Z-space 


1 
z = (x) = Al 


Figure 3.5: (a) The original data set that is not linearly separable, but 
separable by a circle. (b) The transformed data set that is linearly separable 
in the Z space. In the figure, xı maps to zı and x2 maps to z2; the circular 
separator in the ¥-space maps to the linear separator in the Z-space. 





The set of these hypotheses h is denoted by Hg. For instance, when using 
the feature transform in (3.12), each h € He is a quadratic curve in 4 that 
corresponds to some line A in Z. 


Exercise 3.11 


Consider the feature transform ® in (3.12). What kind of boundary in ¥ 
does a hyperplane w in Z correspond to in the following cases? Draw a 
picture that illustrates an example of each case. 


(a) ù >0,%2 <0 
(b) Ñ > 0,2 =0 
(c) w1 > 0,W2 > 0, w0 <0 
(d) w1 > 0, We > 0, Ño > 0 


Because the transformed data set (z1, y1), °°: , (ZN, yn) in Figure 3.5(b) is 
linearly separable in the feature space Z, we can apply PLA on the transformed 
data set to obtain Wp;,, the PLA solution, which gives us a final hypothesis 
g(x) = sign(w?,,z) in the XY space, where z = ®(x). The whole process of 
applying the feature transform before running PLA for linear classification is 
depicted in Figure 3.6. 

The in-sample error in the input space ¥ is the same as in the feature 
space Z, so Fin(g) = 0. Hyperplanes that achieve Fin(Wpi,) = 0 in Z cor- 
respond to separating curves in the original input space ¥. For instance, 
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1. Original data 2. Transform the data 
Xn EX Zn, = (xn) € Z 





0 0.5 1 
4. Classify in Y-space 3. Separate data in Z-space 
g(x) = g(&(x)) = sign(w7 ®(x)) g(z) = sign(w"z) 


Figure 3.6: The nonlinear transform for separating non-separable data. 


as shown in Figure 3.6, the PLA may select the line Wp,, = (—0.6, 0.6, 1) 
that separates the transformed data (z1, y1), ++- , (Zm, yn). The correspond- 
ing hypothesis g(x) = sign(—0.6 + 0.6 : x? + x2) will separate the original data 
(x1,41),°': , (wn, yN). In this case, the decision boundary is an ellipse in ¥. 
How does the feature transform affect the VC bound (3.1)? If we honestly 
decide on the transform ® before seeing the data, then with probability at 
least 1 — 6, the bound (3.1) remains true by using dy.(H@) as the VC dimen- 
sion. For instance, consider the feature transform ® in (3.12). We know that 
Z = {1} xR?. Since He is the perceptron in Z, dvo(Ha) < 3 (the < is because 
some points z € Z may not be valid transforms of any x, so some dichotomies 
may not be realizable). We can then substitute N, dyo(He), and 6 into the 
VC bound. After running PLA on the transformed data set, if we succeed in 
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getting some g with Ein(g) = 0, we can claim that g will perform well out of 
sample. 

It is very important to understand that the claim above is valid only if you 
decide on ® before seeing the data or trying any algorithms. What if we first 
try using lines to separate the data, fail, and then use the circles? Then we 
are effectively using a model that contains both lines and circles, and dve is 
no longer 3. 


Exercise 3.12 

We know that in the Euclidean plane, the perceptron model H cannot 
implement all 16 dichotomies on 4 points. That is, m,(4) < 16. Take the 
feature transform ® in (3.12). 


(a) Show that mz, (3) = 8. 
(b) Show that mz, (4) < 16. 
(c) Show that MHUHS (4) = 16. 


That is, if you used lines, dvc = 3; if you used elipses, dvo = 3; if you used 
lines and elipses, dvo > 3. 


Worse yet, if you actually look at the data (e.g., look at the points in Fig- 
ure 3.1(a)) before deciding on a suitable ®, you forfeit most of what you 
learned in Chapter 2 @). You have inadvertently explored a huge hypothesis 
space in your mind to come up with a specific ® that would work for this data 
set. If you invoke a generalization bound now, you will be charged for the VC 
dimension of the full space that you explored in your mind, not just the space 
that ® creates. 

This does not mean that ® should be chosen blindly. In the credit limit 
problem for instance, we suggested nonlinear features based on the ‘years in 
residence’ field that may be more suitable for linear regression than the raw 
input. This was based on our understanding of the problem, not on ‘snooping’ 
into the training data. Therefore, we pay no price in terms of generalization, 
and we may well gain a dividend in performance because of a good choice of 
features. 

The feature transform ® can be general, as long as it is chosen before seeing 
the data set (as if we cannot emphasize this enough). For instance, you may 
have noticed that the feature transform in (3.12) only allows us to get very 
limited types of quadratic curves. Ellipses that do not center at the origin 
in ¥ cannot correspond to a hyperplane in Z. To get all possible quadratic 
curves in X, we could consider the more general feature transform z = ®2(x), 


S(x) = (1,21, £2, 27, 2120, 23), (3.13) 


which gives us the flexibility to represent any quadratic curve in ¥ by a hy- 
perplane in Z (the subscript 2 of ® is for polynomials of degree 2 - quadratic 
curves). The price we pay is that Z is now five-dimensional instead of two- 
dimensional, and hence dy, is doubled from 3 to 6. 
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Exercise 3.13 


Consider the feature transform z = ®2(x) in (3.13). How can we use a 
hyperplane w in Z to represent the following boundaries in 1? 


(a) The parabola (zı — 3)? + z2 = 1. 

(b) The circle (zı — 3)? + (z2 — 4)? = 1. 

(c) The ellipse 2(21 — 3)? + (x2 — 4)? = 1, 

(d). The hyperbola (zı — 3)? — (z2 — 4)? = 1. 

(e) The ellipse 2(21 + z2 — 3)? + (a1 — z2 — 4)? = 1. 
(f) The line 221 + 22 =1. 


One may further extend 2 to a feature transform ®3 for cubic curves in ¥, 
or more generally define the feature transform ®, for degree-Q curves in Æ. 
The feature transform ®g is called the Qth order polynomial transform. 

The power of the feature transform should be used with care. It may 
not be worth it to insist on linear separability and employ a highly complex 
surface to achieve that. Consider the case of Figure 3.1(a). If we insist on a 
feature transform that linearly separates the data, it may lead to a significant 
increase of the VC dimension. As we see in Figure 3.7, no line can separate the 
training examples perfectly, and neither can any quadratic nor any third-order 
polynomial curves. Thus, we need to use a fourth-order polynomial transform: 


= 2 2 3 n2 2 3 ee n3 2,2 3 n4 
a(x) = (1, £1, £2, LÍ, Pike L3, LI, LIL2, Vi, L3, LI, LIL2, LILI, LILI, L3) 


If you look at the fourth-order decision boundary in Figure 3.7(b), you don’t 
need the VC analysis to tell you that this is an overkill that is unlikely to 
generalize well to new data. A better option would have been to ignore the two 
misclassified examples in Figure 3.7(a), separate the other examples perfectly 
with the line, and accept the small but nonzero Fin. Indeed, sometimes our 
best bet is to go with a simpler hypothesis set while tolerating a small Fin. 

While our discussion of feature transforms has focused on classification 
problems, these transforms can be applied equally to regression problems. 
Both linear regression and logistic regression can be implemented in the feature 
space Z instead of the input space X. For instance, linear regression is often 
coupled with a feature transform to perform nonlinear regression. The N 
by d+1 input matrix X in the algorithm is replaced with the N by d+ 1 
matrix Z, while the output vector y remains the same. 


3.4.2 Computation and Generalization 


Although using a larger Q gives us more flexibility in terms of the shape of 
decision boundaries in 7, there is a price to be paid. Computation is one 
issue, and generalization is the other. 

Computation is an issue because the feature transform ®, maps a two- 
dimensional vector x to d = Eai dimensions, which increases the memory 
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(a) Linear fit (b) 4th order polynomial fit 


Figure 3.7: Illustration of the nonlinear transform using a data set that 
is not linearly separable; (a) a line separates the data after omitting a few 
points, (b) a fourth-order polynomial separates all the points. 


and computational costs. Things could get worse if x is in a higher dimension 
to begin with. 


Exercise 3.14 


Consider the Qth order polynomial transform ® for ¥ = R?. What is 
the dimensionality d of the feature space Z (excluding the fixed coordinate 
zo = 1). Evaluate your result on d € {2,3,5, 10} and Q € {2, 3, 5, 10}. 


The other important issue is generalization. If ® is the feature transform of 
a two-dimensional input space, there will be d = a(Q+8) dimensions in Z, and 
dyo(He) can be as high as QQ) +1. This means that the second term in 
the VC bound (3.1) can grow significantly. In other words, we would have a 
weaker guarantee that Foy, will be small. For instance, if we use ® = Oxo, 
the VC dimension of Ha could be as high as (50)(53) +1 = 1326 instead of the 
original dvo = 3. Applying the rule of thumb that the amount of data needed 
is proportional to the VC dimension, we would need hundreds of times more 
data than we would if we didn’t use a feature transform, in order to achieve 
the same level of generalization error. 


Exercise 3.15 


High-dimensional feature transforms are by no means the only transforms 
that we can use. We can take the tradeoff in the other direction, and 
use low-dimensional feature transforms as well (to achieve an even lower 
generalization error bar). 


(continued on next page) 
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Consider the following feature transform, which maps a d-dimensional x to 
a one-dimensional z, keeping only the kth coordinate of x. 


P(x) = (1; £k). (3.14) 
Let Hp be the set of perceptrons in the feature space. 


(a) Prove that dyc(Hx) = 2. 
(b) Prove that dyc(U¢_,Hx) < 2(log,d +1). 


Hy is called the decision stump model on dimension k. 


The problem of generalization when we go to high-dimensional space is some- 
times balanced by the advantage we get in approximating the target better. As 
we have seen in the case of using quadratic curves instead of lines, the trans- 
formed data became linearly separable, reducing Fi, to 0. In general, when 
choosing the appropriate dimension for the feature transform, we cannot avoid 
the approximation-generalization tradeoff, 








higher d better chance of being linearly separable (Fin 4) dve 
lower d possibly not linearly separable (Ei, T) d 





Therefore, choosing a feature transform before seeing the data is a non-trivial 
task. When we apply learning to a particular problem, some understanding 
of the problem can help in choosing features that work well. More generally, 
there are some guidelines for choosing a suitable transform, or a suitable model, 
which we will discuss in Chapter 4. 


Exercise 3.16 
Write down the steps of the algorithm that combines ®3 with linear re- 


gression. How about using ®1o instead? Where is the main computational 
bottleneck of the resulting algorithm? 


Example 3.5. Let’s revisit the handwritten digit recognition example. We 
can try a different way of decomposing the big task of separating ten digits 
to smaller tasks. One decomposition is to separate digit 1 from all the other 
digits. Using intensity and symmetry as our input variables like we did before, 
the scatter plot of the training data is shown next. A line can roughly separate 
digit 1 from the rest, but a more complicated curve might do better. 
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Symmetry 





Average Intensity 


We use linear regression (for classification), first without any feature transform. 
The results are shown below (LHS). We get Ein = 2.138% and Eout = 2.38%. 


Symmetry 


Symmetry 





Average Intensity Average Intensity 
Linear model 3rd order polynomial model 
Ein = 2.13% Ein = 1.75% 
Bout = 2.38% Hout = 1.87% 


Classification of the digits data (‘1’ versus ‘not 1’) using linear and 
third order polynomial models. 


When we run linear regression with ®3, the third-order polynomial transform, 
we obtain a better fit to the data, with a lower Ein = 1.75%. The result is 
depicted in the RHS of the figure. In this case, the better in-sample fit also 
resulted in a better out-of-sample performance, with Eout = 1.87%. O 


Linear models, a final pitch. The linear model (for classification or regres- 
sion) is an often overlooked resource in the arena of learning from data. Since 
efficient learning algorithms exist for linear models, they are low overhead. 
They are also very robust and have good generalization properties. A sound 
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policy to follow when learning from data is to first try a linear model. Because 
of the good generalization properties of linear models, not much can go wrong. 
If you get a good fit to the data (low Ein), then you are done. If you do not get 
a good enough fit to the data and decide to go for a more complex model, you 
will pay a price in terms of the VC dimension as we have seen in Exercise 3.12, 
but the price is modest. 
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3.5 Problems 


Problem 3.1 Consider the double semi-circle “toy” learning task below. 





There are two semi-circles of width thk with inner radius rad, separated by 
sep as shown (red is —1 and blue is +1). The center of the top semi-circle 
is aligned with the middle of the edge of the bottom semi-circle. This task 
is linearly separable when sep > 0, and not so for sep < 0. Set rad = 10, 
thk = 5 and sep = 5. Then, generate 2,000 examples uniformly, which means 
you will have approximately 1,000 examples for each class. 


(a) Run the PLA starting from w = 0 until it converges. Plot the data and 
the final hypothesis. 


(b) Repeat part (a) using the linear regression (for classification) to obtain w. 
Explain your observations. 


Problem 3.2 For the double-semi-circle task in Problem 3.1, vary sep in 
the range {0.2,0.4,...,5}. Generate 2, 000 examples and run the PLA starting 
with w = 0. Record the number of iterations PLA takes to converge. 


Plot sep versus the number of iterations taken for PLA to converge. Explain 
your observations. [Hint: Problem 1.3.] 


Problem 3.3 For the double-semi-circle task in Problem 3.1, set sep = —5 
and generate 2,000 examples. 


(a) What will happen if you run PLA on those examples? 


(b) Run the pocket algorithm for 100, 000 iterations and plot Fin versus the 
iteration number t. 


(c) Plot the data and the final hypothesis in part (b). 


(continued on next page) 
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(d) Use the linear regression algorithm to obtain the weights w, and compare 
this result with the pocket algorithm in terms of computation time and 
quality of the solution. 


(e) Repeat (b) — (d) with a 3rd order polynomial feature transform. 


Problem 3.4 In Problem 1.5, we introduced the Adaptive Linear Neu- 
ron (Adaline) algorithm for classification. Here, we derive Adaline from an 
optimization perspective. 


(a) Consider En(w) = (max(0,1—ynw7xn))*. Show that En(w) is con- 
tinuous and differentiable. Write down the gradient V En (w). 


(b) Show that En(w) is an upper bound for [sign(w*x,) Æ yn]. Hence, 
+ ean En(w) is an upper bound for the in-sample classification er- 


ror Fin (w). 
(c) Argue that the Adaline algorithm in Problem 1.5 performs stochastic 
gradient descent on + 552; En(w). 


Problem 3.5 


(a) Consider 
En(w) = max(0, 1 — ynw' Xn). 

Show that En(w) is continuous and differentiable except when 
Yn = We AS 

(b) Show that En(w) is an upper bound for [sign(w*xn) Æ yn]. Hence, 
+ 37, En(w) is an upper bound for the in-sample classification er- 
ror Ein(w). 

(c) Apply stochastic gradient descent on + T En(w) (ignoring the sin- 
gular case of w’X» = Yn) and derive a new perceptron learning algorithm. 


Problem 3.6 Derive a linear programming algorithm to fit a linear model 
for classification using the following steps. A linear program is an optimization 
problem of the following form: 


: T 
min CZ 
z 


subject to Az < b. 


A, b and c are parameters of the linear program and z is the optimization vari- 
able. This is such a well studied optimization problem that most mathematics 
software have canned optimization functions which solve linear programs. 


(a) For linearly separable data, show that for some w, yn(w™Xxn) > 1 for 
n= 1,..., N. 
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(b) Formulate the task of finding a separating w for separable data as a linear 
program. You need to specify what the parameters A, b,c are and what 
the optimization variable z is. 


(c) If the data is not separable, the condition in (a) cannot hold for every n. 
Thus introduce the violation €, > 0 to capture the amount of violation 
for example xn. So, for n= 1,..., N, 


Yn (w' Xn) > 1— En, 
En > 0. 


Naturally, we would like to minimize the amount of violation. One intu- 
itive approach is to minimize ae En, i.e., we want w that solves 


N 
MiNw,én ; En 
n=1 


subject to Yn(w'Xn) > 1-— ĉn, 
En 2 0, 
where the inequalities must hold for n = 1,..., N. Formulate this prob- 
lem as a linear program. 


(d) Argue that the linear program you derived in (c) and the optimization 
problem in Problem 3.5 are equivalent. 


Problem 3.7 Use the linear programming algorithm from Problem 3.6 
on the learning task in Problem 3.1 for the separable (sep = 5) and the non- 
separable (sep = —5) cases. 


Compare your results to the linear regression approach with and without the 
3rd order polynomial feature transform. 


Problem 3.8 For linear regression, the out-of-sample error is 
Eout(h) = E [(h(x) — y)"] . 
Show that among all hypotheses, the one that minimizes Four is given by 
h* (x) = Ely | x]. 


The function h* can be treated as a deterministic target function, in which 
case we can write y = h*(x) + e(x) where e(x) is an (input dependent) noise 
variable. Show that e(x) has expected value zero. 
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Problem 3.9 Assuming that X"X is invertible, show by direct comparison 
with Equation (3.4) that Ein(w) can be written as 


Ein (w) 
= (w — (X"X)*X"y)"(X"X)(w — (X"X)X"y) + y” (I — X(X"X)*X")y. 


Use this expression for Ein to obtain Wiin. What is the in-sample error? [Hint: 
The matrix X"X is positive definite.] 


Problem 3.10 Exercise 3.3 studied some properties of the hat matrix 
H = X(X7X)~1X7, where X is a N by d+1 matrix, and X7X is invertible. 
Show the following additional properties. 


(a) Every eigenvalue of H is either 0 or 1. [Hint: Exercise 3.3(b).] 


(b) Show that the trace of a symmetric matrix equals the sum of its eigen- 
values. [Hint: Use the spectral theorem and the cyclic property of the 
trace. Note that the same result holds for non-symmetric matrices, but 
is a little harder to prove.] 


(c) How many eigenvalues of H are 1? What is the rank of H? (Hint: 
Exercise 3.3(d).] 


Problem 3.11 Consider the linear regression problem setup in Exercise 3.4, 
where the data comes from a genuine linear relationship with added noise. The 
noise for the different data points is assumed to be iid with zero mean and 
variance a”. Assume that the 2nd moment matrix © = Ex[xx"] is non-singular. 
Follow the steps below to show that, with high probability, the out-of-sample 
error on average is 


d+1 
Eout (Wiin) = o’ (1 + Ore + o()) g 


(a) For a test point x, show that the error y — g(x) is 
eee ky) RS 
where € is the noise realization for the test point and e€ is the vector of 
noise realizations on the data. 
(b) Take the expectation with respect to the test point, i.e., x and e€, to 
obtain an expression for Hout. Show that 
Eout = o° + trace (D(X"X)~*XK™ee™X"(X7X)~*). 
[Hints: a = trace(a) for any scalar a; trace(AB) = trace(BA); expecta- 
tion and trace commute.] 
(c) What is E.[ee"]? 
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(d) Take the expectation with respect to € to show that, on average, 
2,0" 1 
Eout = 0° + >> trace (E(4X"X) +). 


Note that 4X"X = 4 DA XnxX,, is an N-sample estimate of X. So 
ŁX'”X eo. If XTX = X}, then what is Bout on average? 


(e) Show that (after taking the expectation over the data noise) with high 


probability, 
d+1 
Fou = 0? (2 + oth, o()) , 


[Hint: By the law of large numbers XTX converges in probability to 


X, and so by continuity of the inverse at ©, (G converges in 
probability to X+. ] 


Problem 3.12 In linear regression, the in-sample predictions are given by 
y = Hy, where H = X(X7X)~'X7. Show that H is a projection matrix, i.e. 
H? = H. So ¥ is the projection of y onto some space. What is this space? 


Problem 3.13 This problem creates a linear regression algorithm from 
a good algorithm for linear classification. As illustrated, the idea is to take the 
original data and shift it in one direction to get the +1 data points; then, shift 
it in the opposite direction to get the —1 data points. 





Original data for the one- Shifted data viewed as a 
dimensional regression prob- two-dimensional classifica- 
lem tion problem 


More generally, The data (xn, Yn) can be viewed as data points in R¢** by 
treating the y-value as the (d + 1)th coordinate. 


(continued on next page) 
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Now, construct positive and negative points 


D+ = (x1,y1) +a, ..., (XN, yn) +a 


D- = (x1,y1)— a,..., (XN, yN) — a, 


where a is a perturbation parameter. You can now use the linear programming 
algorithm in Problem 3.6 to separate D4} from D_. The resulting separating 
hyperplane can be used as the regression ‘fit’ to the original data. 


(a) How many weights are learned in the classification problem? How many 
weights are needed for the linear fit in the regression problem? 


(b) The linear fit requires weights w, where h(x) = wx. Suppose the 
weights returned by solving the classification problem are Welass. Derive 
an expression for w as a function of Welass- 


(c) Generate a data set yn = £? +0en with N = 50, where £n is uniform on 
(0, 1] and en is zero mean Gaussian noise; set ø = 0.1. Plot D+ and D- 
0 
fora = 01: 
(d) Give comparisons of the resulting fits from running the classification ap- 
proach and the analytic pseudo-inverse algorithm for linear regression. 


Problem 3.14 In a regression setting, assume the target function is linear, 
so f(x) = x™wy, and y = Xwy + e, where the entries in € are zero mean, iid 
with variance o”. In this problem derive the bias and variance as follows. 


(a) Show that the average function is g(x) = f(x), no matter what the size 
of the data set, as long as XTX is invertible. What is the bias? 


(b) What is the variance? [Hint: Problem 3.11] 


Problem 3.15 In the text we derived that the linear regression solution 
weights must satisfy X7"Xw = X7y. If X"X is not invertible, the solution 
Wiin = (X7X)~1X7y won't work. In this event, there will be many solutions 
for w that minimize Fin. Here, you will derive one such solution. Let p be 
the rank of X. Assume that the singular value decomposition (SVD) of X is 
X = UTV", where U € R%? satisfies UTU = Ip, V € RCI) x? satisfies 
VTV = Ip, and T € R°*%”?’ is a positive diagonal matrix. 


(a) Show that p < d+1. 


(b) Show that win = VI~ tU"y satisfies X”°'Xwin = X"y, and hence is a 
solution. 


(c) Show that for any other solution that satisfies X*Xw = Xy, ||wiin|| < 
||w||. That is, the solution we have constructed is the minimum norm set 
of weights that minimizes Fin. 
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Problem 3.16 In Example 3.4, it is mentioned that the output of the 
final hypothesis g(x) learned using logistic regression can be thresholded to get 
a ‘hard’ (+1) classification. This problem shows how to use the risk matrix 
introduced in Example 1.1 to obtain such a threshold. 


Consider fingerprint verification, as in Example 1.1. After learning from the 
data using logistic regression, you produce the final hypothesis 


g(x) = Ply = +1 | x], 


which is your estimate of the probability that y = +1. Suppose that the cost 
matrix is given by 


True classification 

+1 (correct person) —1 (intruder) 
0 Ca 

—1 Cr 0 








you say 


For a new person with fingerprint x, you compute g(x) and you now need to de- 
cide whether to accept or reject the person (i.e., you need a hard classification). 
So, you will accept if g(x) > x, where « is the threshold. 


(a) Define the cost(accept) as your expected cost if you accept the person. 
Similarly define cost(reject). Show that 


cost(accept) = (1 —g(x))ca, 
cost(reject) = g(x)cr. 


(b) Use part (a) to derive a condition on g(x) for accepting the person and 


hence show that 
Ca 


Ca + Cr 





K = 


(c) Use the cost-matrices for the Supermarket and CIA applications in Ex- 
ample 1.1 to compute the threshold « for each of these two cases. Give 
some intuition for the thresholds you get. 


Problem 3.17 Consider a function 
E(u, v) = e” +e” +e” +u? — 3w + 4v? — 3u — 5v, 
(a) Approximate E(u + Au,v + Av) by Ei(Au, Av), where Ê is the 
first-order Taylor's expansion of Æ around (u,v) = (0,0). Suppose 


Fy (Au, Av) = auAu + ayAv +a. What are the values of au, av, 
and a? 


(continued on next page) 
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(b) Minimize Ê, over all possible (Au, Av) such that ||(Au, Av)|| = 0.5. 


(c 


(d 


N 


) 


wr 


In this chapter, we proved that the optimal column vector A is 


parallel to the column vector —V E(u, v), which is called the negative 
gradient direction. Compute the optimal (Au, Av) and the resulting 
E(u + Au, v + Av). 


Approximate E(u+ Au, v+ Av) by Êz(Au, Av), where E2 is the second- 
order Taylor's expansion of E around (u,v) = (0,0). Suppose 


Êz(Au, Av) = buu (Au) +byy (Av)? + buy (Au) (Av) +bu Au +b Av +b. 


What are the values of buu, buv, buv, bu, by, and b? 


Minimize É> over all possible (Au, Av) (regardless of length). Use the 
fact that V? E(u, oio 0) (the Hessian matrix at (0, 0)) is positive definite 
to prove that the optimal column vector 


i] =- (wera) wn, 


which is called the Newton direction. 
Numerically compute the following values: 
(i) the vector (Au, Av) of length 0.5 along the Newton direction, and 
the resulting E(u + Au, v + Av). 
(ii) the vector (Au, Av) of length 0.5 that minimizes E(u+ Au, v+Av), 
and the resulting E(u + Au, v + Av). (Hint: Let Au = 0.5sin 8.) 


Compare the values of E(u + Au, v + Av) in (b), (e-i), and (e-ii). Briefly 
state your findings. 


The negative gradient direction and the Newton direction are quite fundamental 
for designing optimization algorithms. It is important to understand these 
directions and put them in your toolbox for designing learning algorithms. 


Problem 3.18 Take the feature transform ®2 in Equation (3.13) as ®. 


(a) Show that dyc(Hs) < 6. 

(b) Show that dyco(He) > 4. [Hint: Exercise 3.12] 
(c) Give an upper bound on dyc(Ha,) for X = R°. 
(d) Define 


= 2 2 2 
2: x > (1, £1, £2, L1 + £2, £1 — L2, £1, 2122, L281, £3) for x E R^. 


Argue that dyvc(He,) = dve(Hg,). In other words, while (X) € R®, 
dvo(Hg,) < 6 < 9. Thus, the dimension of ®(X) only gives an upper 
bound of dyo(Ha), and the exact value of dyo(Hs) can depend on the 
components of the transform. 
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Problem 3.19 A Transformer thinks the following procedures would 
work well in learning from two-dimensional data sets of any size. Please point 
out if there are any potential problems in the procedures: 


(a) Use the feature transform 


(0,-+-,0,1,0,---) ifx=Xn 


(0,0,--- ,0) otherwise . 


before running PLA. 
(b) Use the feature transform ® with 


on (x) = exp (- Set) 


using some very small y. 


(c) Use the feature transform ® that consists of all 


$i,j(x) = exp ( Sear 


before running PLA, with ¿ € {0, I ..., 1} and j € {0, z5” 1} 
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Chapter 4 


Overfitting 


Paraskavedekatriaphobia! (fear of Friday the 13th), and superstitions in gen- 
eral, are perhaps the most illustrious cases of the human ability to overfit. 
Unfortunate events are memorable, and given a few such memorable events, 
it is natural to try and find an explanation. In the future, will there be more 
unfortunate events on Friday the 13th’s than on any other day? 

Overfitting is the phenomenon where fitting the observed facts (data) well 
no longer indicates that we will get a decent out-of-sample error, and may 
actually lead to the opposite effect. You have probably seen cases of overfit- 
ting when the learning model is more complex than is necessary to represent 
the target function. The model uses its additional degrees of freedom to fit 
idiosyncrasies in the data (for example, noise), yielding a final hypothesis that 
is inferior. Overfitting can occur even when the hypothesis set contains only 
functions which are far simpler than the target function, and so the plot thick- 
ens ®©). 

The ability to deal with overfitting is what separates professionals from 
amateurs in the field of learning from data. We will cover three themes: 
When does overfitting occur? What are the tools to combat overfitting? How 
can one estimate the degree of overfitting and ‘certify’ that a model is good, 
or better than another? Our emphasis will be on techniques that work well in 
practice. 


4.1 When Does Overfitting Occur? 


Overfitting literally means “Fitting the data more than is warranted.” ‘The 
main case of overfitting is when you pick the hypothesis with lower Fin, and 
it results in higher Eout. This means that Fj, alone is no longer a good guide 
for learning. Let us start by identifying the cause of overfitting. 


lfrom the Greek paraskevi (Friday), dekatreis (thirteen), phobia (fear) 
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Consider a simple one-dimensional regression problem with five data points. 
We do not know the target function, so let’s select a general model, maximiz- 
ing our chance to capture the target function. Since 5 data points can be fit 
by a 4th order polynomial, we select 4th order polynomials. 

The result is shown on the right. The 


target function is a 2nd order polynomial © Data 
(blue curve), with a little added noise in en 
— Fit 


the data points. Though the target is 

simple, the learning algorithm used the 

full power of the 4th order polynomial to 

fit the data exactly, but the result does 

not look anything like the target function. 

The data has been ‘overfit.’ The little 

noise in the data has misled the learning, x 
for if there were no noise, the fitted red 

curve would exactly match the target. This is a typical overfitting scenario, 
in which a complex model uses its additional degrees of freedom to ‘learn’ the 
noise. 

The fit has zero in-sample error but huge out-of-sample error, so this is a 
case of bad generalization (as discussed in Chapter 2) — a likely outcome when 
overfitting is occurring. However, our definition of overfitting goes beyond 
bad generalization for any given hypothesis. Instead, overfitting applies to 
a process: in this case, the process of picking a hypothesis with lower and 
lower Fin resulting in higher and higher Fut. 


4.1.1 A Case Study: Overfitting with Polynomials 


Let’s dig deeper to gain a better understanding of when overfitting occurs. We 
will illustrate the main concepts using data in one-dimension and polynomial 
regression, a special case of a linear model that uses the feature transform 
z+ (1,2,27,---). Consider the two regression problems below: 







O Data © Data 
— Target — Target 


z T 


(a) 10th order target function (b) 50th order target function 


In both problems, the target function is a polynomial and the data set D 
contains 15 data points. In (a), the target function is a 10th order polynomial 
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— 2nd Order Fit 
—10th Order Fit 


— 2nd Order Fit 
— 10th Order Fit 


(a) Noisy low-order target (b) Noiseless high-order target 


Figure 4.1: Fits using 2nd and 10th order polynomials to 15 data points. 
In (a), the data are noisy and the target is a 10th order polynomial. In (b) 
the data are noiseless and the the target is a 50th order polynomial. 


and the sampled data are noisy (the data do not lie on the target function 
curve). In (b), the target function is a 50th order polynomial and the data are 
noiseless. 

The best 2nd and 10th order fits are shown in Figure 4.1, and the in-sample 
and out-of-sample errors are given in the following table. 


10th order noisy target 50th order noiseless target 

2nd Order 10th Order 2nd Order 10th Order 
0.050 0.034 0.029 107 
0.127 9.00 0.120 7680 













Eout 


What the learning algorithm sees is the data, not the target function. In both 
cases, the 10th order polynomial heavily overfits the data, and results in a 
nonsensical final hypothesis which does not resemble the target function. The 
2nd order fits do not capture the full nature of the target function either, but 
they do at least capture its general trend, resulting in significantly lower out-of- 
sample error. The 10th order fits have lower in-sample error and higher out-of- 
sample error, so this is indeed a case of overfitting that results in pathologically 
bad generalization. 


Exercise 4.1 


Let H2 and Hio be the 2nd and 10th order hypothesis sets respectively. 
Specify these sets as parameterized sets of functions. Show that H2 C Ho. 


These two examples reveal some surprising phenomena. Let’s consider first the 
10th order target function, Figure 4.1(a). Here is the scenario. Two learners, O 
(for overfitted) and R (for restricted), know that the target function is a 10th 
order polynomial, and that they will receive 15 noisy data points. Learner O 


121 


4. OVERFITTING 4.1. WHEN DOES OVERFITTING OCCUR? 





Learning curves for H2 Learning curves for Hio 
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Figure 4.2: Overfitting is occurring for N in the shaded gray region 
because by choosing H1io which has better Ein, you get worse Eout. 


uses model H10, which is known to contain the target function, and finds the 
best fitting hypothesis to the data. Learner R uses model H2, and similarly 
finds the best fitting hypothesis to the data. 


The surprising thing is that learner R wins (lower out-of-sample error) by 
using the smaller model, even though she has knowingly given up the ability 
to implement the true target function. Learner R trades off a worse in-sample 
error for a huge gain in the generalization error, ultimately resulting in lower 
out-of-sample error. What is funny here? A folklore belief about learning is 
that best results are obtained by incorporating as much information about the 
target function as is available. But as we see here, even if we know the order 
of the target and naively incorporate this knowledge by choosing the model 
accordingly (H10), the performance is inferior to that demonstrated by the 
more ‘stable’ 2nd order model. 

The models Hz and Hio were in fact the ones used to generate the learn- 
ing curves in Chapter 2, and we use those same learning curves to illustrate 
overfitting in Figure 4.2. If you mentally superimpose the two plots, you can 
see that there is a range of N for which H1o has lower Fin but higher Eout 
than H2 does, a case in point of overfitting. 

Is learner R always going to prevail? Certainly not. For example, if the 
data was noiseless, then indeed learner O would recover the target function 
exactly from 15 data points, while learner R would have no hope. This brings 
us to the second example, Figure 4.1(b). Here, the data is noiseless, but the 
target function is very complex (50th order polynomial). Again learner R 
wins, and again because learner O heavily overfits the data. Overfitting is 
not a disease inflicted only upon complex models with many more degrees of 
freedom than warranted by the complexity of the target function. In fact the 
reverse is true here, and overfitting is just as bad. What matters is how the 
model complexity matches the quantity and quality of the data we have, not 
how it matches the target function. 
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4.1.2 Catalysts for Overfitting 


A skeptical reader should ask whether the examples in Figure 4.1 are just 
pathological constructions created by the authors, or is overfitting a real phe- 
nomenon which has to be considered carefully when learning from data? The 
next exercise guides you through an experimental design for studying overfit- 
ting within our current setup. We will use the results from this experiment 
to serve two purposes: to convince you that overfitting is not the result of 
some rare pathological construction, and to unravel some of the conditions 
conducive to overfitting. 


Exercise 4.2 [Experimental design for studying overfitting] 


This is a reading exercise that sets up an experimental framework to study 
various aspects of overfitting. The reader interested in implementing the 
experiment can find the details fleshed out in Problem 4.4. The input 
space is X = [—1,1], with uniform input probability density, P(x) = 4. 
We consider the two models He and Hio. 

The target is a degree-Qy polynomial, which we write f(x) = 
DA ag L(x), where L;(x) are polynomials of increasing complexity (the 
Legendre polynomials). The data set is D = (r1, y1),...,(Zw, yn), where 
Yn = f(&n) + gén and en are iid (independent and identically distributed) 
standard Normal random variates. 


For a single experiment, with specified values for Qf, N, o, generate a ran- 
dom degree-Qy target function by selecting coefficients a; independently 
from a standard Normal, rescaling them so that Ea, [f°] = 1. Gen- 
erate a data set, selecting 21,...,2n independently according to P(x) 
and yn = f(an) + Gen. Let g2 and gio be the best fit hypotheses to 
the data from H2 and Hao respectively, with out-of-sample errors Eout (92) 
and Eout (gio). 

Vary Qf, N,o, and for each combination of parameters, run a large number 
of experiments, each time computing Fout(g2) and Eout(gio). Averaging 
these out-of-sample errors gives estimates of the expected out-of-sample 
error for the given learning scenario (Qs, N, o) using H2 and Hio. 


Exercise 4.2 set up an experiment to study how the noise level o°, the target 
complexity Qz, and the number of data points N relate to overfitting. We 
compare the final hypothesis gio € H10 (larger model) to the final hypothesis 
g2 € He (smaller model). Clearly, Ein(gio) < Ein(g2) since gio has more 
degrees of freedom to fit the data. What is surprising is how often gio overfits 
the data, resulting in Fout(gio) > Eout(g2). Let us define the overfit measure 
as Fout(gio) — Fout(g2). The more positive this measure is, the more severe 
overfitting would be. 

Figure 4.3 shows how the extent of overfitting depends on certain parame- 
ters of the learning problem (the results are from our implementation of Exer- 
cise 4.2). In the figure, the colors map to the level of overfitting, with redder 
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Figure 4.3: How overfitting depends on the noise o”, the target function 
complexity Qz, and the number of data points N. The colors map to the 
overfit measure Fout(H10) — Bout (H2). In (a) we see how overfitting depends 
on o° and N, with Q f = 20. As o° increases we are adding stochastic noise 
to the data. In (b) we see how overfitting depends on Q; and N, with 
o? =0.1. As Qp increases we are adding deterministic noise to the data. 


regions showing worse overfitting. These red regions are large—overfitting is 
real, and here to stay. 

Figure 4.3(a) reveals that there is less overfitting when the noise level o? 
drops or when the number of data points N increases (the linear pattern in 
Figure 4.3(a) is typical). Since the ‘signal’ f is normalized to E[f?] = 1, 
the noise level o? is automatically calibrated to the signal level. Noise leads 
the learning astray, and the larger, more complex model is more susceptible to 
noise than the simpler one because it has more ways to go astray. Figure 4.3(b) 
reveals that target function complexity Qy affects overfitting in a similar way 
to noise, albeit nonlinearly. To summarize, 


Number of data points f Overfitting | 
Noise sy Overfitting Î 





Target complexity | Overfitting f 


Deterministic noise. Why does a higher target complexity lead to more 
overfitting when comparing the same two models? The intuition is that for a 
given learning model, there is a best approximation to the target function. The 
part of the target function ‘outside’ this best fit acts like noise in the data. We 
can call this deterministic noise to differentiate it from the random stochastic 
noise. Just as stochastic noise cannot be modeled, the deterministic noise 
is that part of the target function which cannot be modeled. The learning 
algorithm should not attempt to fit the noise; however, it cannot distinguish 
noise from signal. On a finite data set, the algorithm inadvertently uses some 
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Figure 4.4: Deterministic noise. h* is the best fit to f in H2. The shading 
illustrates deterministic noise for this learning problem. 


of the degrees of freedom to fit the noise, which can result in overfitting and 
a spurious final hypothesis. 

Figure 4.4 illustrates deterministic noise for a quadratic model fitting a 
more complex target function. While stochastic and deterministic noise have 
similar effects on overfitting, there are two basic differences between the two 
types of noise. First, if we generated the same data (x values) again, the 
deterministic noise would not change but the stochastic noise would. Second, 
different models capture different ‘parts’ of the target function, hence the same 
data set will have different deterministic noise depending on which model we 
use. In reality, we work with one model at a time and have only one data set 
on hand. Hence, we have one realization of the noise to work with and the 
algorithm cannot differentiate between the two types of noise. 


Exercise 4.3 
Deterministic noise depends on H, as some models approximate f better 
than others. 


(a) Assume H is fixed and we increase the complexity of f. Will deter- 
ministic noise in general go up or down? Is there a higher or lower 
tendency to overfit? 


(b) Assume f is fixed and we decrease the complexity of H. Will deter- 
ministic noise in general go up or down? Is there a higher or lower 
tendency to overfit? [Hint: There is a race between two factors that 
affect overfitting in opposite ways, but one wins.] 


The bias-variance decomposition, which we discussed in Section 2.3.1 (see also 
Problem 2.22) is a useful tool for understanding how noise affects performance: 


Ep|Eout] = g? + bias + var. 


The first two terms reflect the direct impact of the stochastic and determin- 
istic noise. The variance of the stochastic noise is o? and the bias is directly 
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related to the deterministic noise in that it captures the model’s inability to 
approximate f. The var term is indirectly impacted by both types of noise, 
capturing a model’s susceptibility to being led astray by the noise. 


4.2 Regularization 


Regularization is our first weapon to combat overfitting. It constrains the 
learning algorithm to improve out-of-sample error, especially when noise is 
present. To whet your appetite, look at what a little regularization can do for 
our first overfitting example in Section 4.1. Though we only used a very small 
‘amount’ of regularization, the fit improves dramatically. 





without regularization with regularization 


Now that we have your attention, we would like to come clean. Regularization 
is as much an art as it is a science. Most of the methods used successfully 
in practice are heuristic methods. However, these methods are grounded in a 
mathematical framework that is developed for special cases. We will discuss 
both the mathematical and the heuristic, trying to maintain a balance that 
reflects the reality of the field. 

Speaking of heuristics, one view of regularization is through the lens of the 
VC bound, which bounds Fout using a model complexity penalty Q(H): 


Bout(h) < Bin(h) + QO(H) for all hE H. (4.1) 


So, we are better off if we fit the data using a simple H. Extrapolating one step 
further, we should be better off by fitting the data using a ‘simple’ h from H. 
The essence of regularization is to concoct a measure (0(h) for the complexity 
of an individual hypothesis. Instead of minimizing Fin (h) alone, one minimizes 
a combination of Fi,(h) and Q(h). This avoids overfitting by constraining the 
learning algorithm to fit the data well using a simple hypothesis. 


Example 4.1. One popular regularization technique is weight decay, which 
measures the complexity of a hypothesis h by the size of the coefficients used 
to represent h (e.g. in a linear model). This heuristic prefers mild lines with 
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small offset and slope, to wild lines with bigger offset and slope. We will get to 
the mechanics of weight decay shortly, but for now let’s focus on the outcome. 

We apply weight decay to fitting the target f(x) = sin(wx) using N = 2 
data points (as in Example 2.8). We sample x uniformly in [—1, 1], generate a 
data set and fit a line to the data (our model is H1). The figures below show the 
resulting fits on the same (random) data sets with and without regularization. 




















x 





without regularization with regularization 


Without regularization, the learned function varies extensively depending on 
the data set. As we have seen in Example 2.8, a constant model scored 
Eout = 0.75, handily beating the performance of the (unregularized) linear 
model that scored Eou = 1.90. With a little weight decay regularization, 
the fits to the same data sets are considerably less volatile. This results in a 
significantly lower Fou, = 0.56 that beats both the constant model and the 
unregularized linear model. 

The bias-variance decomposition helps us to understand how the regular- 
ized version beat both the unregularized version as well as the constant model. 


sin(7x) 





without regularization with regularization 
bias = 0.21; bias = 0.23; 
var = 1.69. var = 0.33. 


Average hypothesis g (red) with var(a) indicated by the gray shaded 


region that is g(a) + \/var(a). 


As expected, regularization reduced the var term rather dramatically from 1.69 
down to 0.33. The price paid in terms of the bias (quality of the average fit) was 
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modest, only slightly increasing from 0.21 to 0.23. The result was a significant 
decrease in the expected out-of-sample error because bias+-var decreased. This 
is the crux of regularization. By constraining the learning algorithm to select 
‘simpler’ hypotheses from H, we sacrifice a little bias for a significant gain in 
the var. O 


This example also illustrates why regularization is needed. The linear 
model is too sophisticated for the amount of data we have, since a line can 
perfectly fit any 2 points. This need would persist even if we changed the 
target function, as long as we have either stochastic or deterministic noise. 
The need for regularization depends on the quantity and quality of the data. 
Given our meager data set, our choices were either to take a simpler model, 
such as the model with constant functions, or to constrain the linear model. It 
turns out that using the complex model but constraining the algorithm toward 
simpler hypotheses gives us more flexibility, and ends up giving the best Eout. 
In practice, this is the rule not the exception. 

Enough heuristics. Let’s develop the mathematics of regularization. 


4.2.1 A Soft Order Constraint 


In this section, we derive a regularization method that applies to a wide va- 
riety of learning problems. To simplify the math, we will use the concrete 
setting of regression using Legendre polynomials, the polynomials of increas- 
ing complexity used in Exercise 4.2. So, let’s first formally introduce you to 
the Legendre polynomials. 

Consider a learning model where H is the set of polynomials in one vari- 
able x € [—1, 1]. Instead of expressing the polynomials in terms of consecutive 
powers of x, we will express them as a combination of Legendre polynomials 
in z. Legendre polynomials are a standard set of polynomials with nice ana- 
lytic properties that result in simpler derivations. The zeroth-order Legendre 
polynomial is the constant Lo(x) = 1, and the first few Legendre polynomials 
are illustrated below. 











Lə 





4(3x? — 1) (523 — 3x) 








As you can see, when the order of the Legendre polynomial increases, the curve 
gets more complex. Legendre polynomials are orthogonal to each other within 
x € |—1,1], and any regular polynomial can be written as a linear combination 
of Legendre polynomials, just like it can be written as a linear combination of 
powers of x. 
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Polynomial models are a special case of linear models in a space Z, under 
a nonlinear transformation ®: Y — Z. Here, for the Qth order polynomial 
model, ® transforms x into a vector z of Legendre polynomials, 


1 
Li (x) 


Our hypothesis set Ho is a linear combination of these polynomials, 


m=i 


where Lo(x) = 1. As usual, we will sometimes refer to the hypothesis h by its 
weight vector w.? Since each A is linear in w, we can use the machinery of 
linear regression from Chapter 3 to minimize the squared error 





q=0 


Q 
AO eae vata . 
wERQ+i 


N 
Ea(w) = (Wan — yn). (4.2) 


The case of polynomial regression with squared-error measure illustrates the 
main ideas of regularization well, and facilitates a solid mathematical deriva- 
tion. Nonetheless, our discussion will generalize in practice to non-linear, 
multi-dimensional settings with more general error measures. The baseline al- 
gorithm (without regularization) is to minimize Ein over the hypotheses in Ha 
to produce the final hypothesis g(x) = wiz, where Wiin = argmin Ein (w). 

w 


Exercise 4.4 
Let Z = [zı ... zn] be the data matrix (assume Z has full column 
rank); let win = (ZZ) Zt y; and let H = WAZ) ZT (the hat matrix 
of Exercise 3.3). Show that 

(w — Win) Z"Z(w — win) ty (I— H)y 


F oc (43) 


Ein (w) = 


where I is the identity matrix. 
(a) What value of w minimizes Fin? 


(b) What is the minimum in-sample error? 


The task of regularization, which results in a final hypothesis Wreg instead of 
the simple Wiin, is to constrain the learning so as to prevent overfitting the 





2We used w and d for the weight vector and dimension in Z. Since we are explicitly 
dealing with polynomials and Z is the only space around, we use w and Q for simplicity. 
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data. We have already seen an example of constraining the learning; the set H2 
can be thought of as a constrained version of H49 in the sense that some of 
the Hio weights are required to be zero. That is, He is a subset of Hio defined 
by He = {w | w € Hio; wq = 0 for q > 3}. Requiring some weights to be 0 is 
a hard constraint. We have seen that such a hard constraint on the order can 
help, for example Hə is better than Ho when there is a lot of noise and N is 
small. Instead of requiring some weights to be zero, we can force the weights 
to be small but not necessarily zero through a softer constraint such as 


Q 
5 we <C. 
q=0 


This is a ‘soft order’ constraint because it only encourages each weight to be 
small, without changing the order of the polynomial by explicitly setting some 
weights to zero. The in-sample optimization problem becomes: 


min Ein(w) subject to w'w <C. (4.4) 


The data determines the optimal weight sizes, given the total budget C which 
determines the amount of regularization; the larger C is, the weaker the con- 
straint and the smaller the amount of regularization. We can define the soft- 
order-constrained hypothesis set H(C) by 


H(C) = {h | h(x) = wz , ww < C}. 


Equation (4.4) is equivalent to minimizing Ein over H(C). If Cy < Co, then 
H(Ci) C H(C2) and so dyo(H(C1)) < dyvc(H(C2)), and we expect better 
generalization with H(C,). Let the regularized weights Wreg be the solution 
to (4.4). 


Solving for Wreg. If winWlin < C then Wreg = Fy > const 
Win because Wiin € H(C). If win ¢ H(C), then 
not only is WyegWreg < C, but in fact Wig Wreg = C 
(Wreg uses the entire budget C; see Problem 4.10). 
We thus need to minimize Ej, subject to the 
equality constraint ww = C. The situation is 
illustrated to the right. The weights w must lie 
on the surface of the sphere w"w = C; the normal 
vector to this surface at w is the vector w itself ww=¢C 
(also in red). A surface of constant Fin is shown in 
blue; this surface is a quadratic surface (see Exercise 4.4) and the normal to 
this surface is V Ein(w). In this case, w cannot be optimal because V Fin(w) is 
not parallel to the red normal vector. This means that V Ein(w) has some non- 
zero component along the constraint surface, and by moving a small amount 


in the opposite direction of this component we can improve Fin, while still 









Wiin 
e 


normal 
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remaining on the surface. If Wreg is to be optimal, then for some positive 
parameter Ac 


V Ein (Wreg) = —2A\CWreg, 


ie, VEin must be parallel to Wreg, the normal vector to the constraint surface 
(the scaling by 2 is for mathematical convenience and the negative sign is 
because V Ein and w are in opposite directions). Equivalently, Wreg satisfies 


V (Ein(w) + Acw"w)| = 0, 


W=Wreg 
because V(w7w) = 2w. So, for some Ac > 0, Wreg locally minimizes 
Ein(w) + A\cw'w. (4.5) 


The parameter Ac and the vector Wreg (both of which depend on C and the 
data) must be chosen so as to simultaneously satisfy the gradient equality and 
the weight norm constraint Wieg Wreg = C2 That Ac > 0 is intuitive since 
we are enforcing smaller weights, and minimizing Fi,(w) + Acw’w would 
not lead to smaller weights if Ac were negative. Note that if wi,Win < C, 
Wreg = Wiin and minimizing (4.5) still holds with \c = 0. Therefore, we 
have an equivalence between solving the constrained problem (4.4) and the 
unconstrained minimization of (4.5). This equivalence means that minimiz- 
ing (4.5) is similar to minimizing Fin using a smaller hypothesis set, which in 
turn means that we can expect better generalization by minimizing (4.5) than 
by just minimizing Fin. 

Other variations of the constraint in (4.4) can be used to emphasize some 
weights over the others. Consider the constraint pean Yw? < C. The im- 
portance y, given to weight wg determines the type of regularization. For 
example, Yq = q Or Yq = e? encourages a low-order fit, and yg = (1 + q)~™+ or 
Yq =e 7 encourages a high-order fit. In extreme cases, one recovers hard-order 
constraints by choosing some yq = 0 and some yg — oo. 


Exercise 4.5 [Tikhonov regularizer] 
A more general soft constraint is the Tikhonov regularization constraint 
wITw< C 


which can capture relationships among the w; (the matrix T is the Tikhonov 
regularizer). 


(a) What should T be to obtain the constraint eee we <C? 


(b) What should T be to obtain the constraint os 5 wa) < C? 





3\c is known as a Lagrange multiplier and an alternate derivation of these same results 
can be obtained via the theory of Lagrange multipliers for constrained optimization. 
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4.2.2 Weight Decay and Augmented Error 


The soft-order constraint for a given value of C is a constrained minimiza- 
tion of Eim. Equation (4.5) suggests that we may equivalently solve an un- 
constrained minimization of a different function. Let’s define the augmented 


error, 
Eaug(w) = Ein(w) + Aw'w, (4.6) 


where A > 0 is now a free parameter at our disposal. The augmented error has 
two terms. The first is the in-sample error which we are used to minimizing, 
and the second is a penalty term. Notice that this fits the heuristic view of 
regularization that we discussed earlier, where the penalty for complexity is 
defined for each individual h instead of H as a whole. When A = 0, we have the 
usual in-sample error. For A > 0, minimizing the augmented error corresponds 
to minimizing a penalized in-sample error. The value of A controls the amount 
of regularization. The penalty term w*w enforces a tradeoff between making 
the in-sample error small and making the weights small, and has become known 
as weight decay. As discussed in Problem 4.8, if we minimize the augmented 
error using an iterative method like gradient descent, we will have a reduction 
of the in-sample error together with a gradual shrinking of the weights, hence 
the name weight ‘decay.’ In the statistics community, this type of penalty 
term is a form of ridge regression. 

There is an equivalence between the soft order constraint and augmented 
error minimization. In the soft-order constraint, the amount of regularization 
is controlled by the parameter C. From (4.5), there is a particular Ag (depend- 
ing on C and the data D), for which minimizing the augmented error Eaug(w) 
leads to the same final hypothesis Wreg. A larger C allows larger weights and 
is a weaker soft-order constraint; this corresponds to smaller À, i.e., less em- 
phasis on the penalty term w*w in the augmented error. For a particular 
data set, the optimal value C* leading to minimum out-of-sample error with 
the soft-order constraint corresponds to an optimal value \* in the augmented 
error minimization. If we can find A*, we can get the minimum Eout. 

Have we gained from the augmented error view? Yes, because augmented 
error minimization is unconstrained, which is generally easier than constrained 
minimization. For example, we can obtain a closed form solution for linear 
models or use a method like stochastic gradient descent to carry out the mini- 
mization. However, augmented error minimization is not so easy to interpret. 
There are no values for the weights which are explicitly forbidden, as there 
are in the soft-order constraint. For a given C, the soft-order constraint cor- 
responds to selecting a hypothesis from the smaller set H(C), and so from 
our VC analysis we should expect better generalization when C decreases (A 
increases). It is through the relationship between > and C that one has a 
theoretical justification of weight decay as a method for regularization. 

We focused on the soft-order constraint w"w < C with corresponding 
augmented error Faug(w) = Ein(w) + Aww. However, our discussion applies 
more generally. There is a duality between the minimization of the in-sample 
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error over a constrained hypothesis set and the unconstrained minimization of 
an augmented error. We may choose to live in either world, but more often 
than not, the unconstrained minimization of the augmented error is more 
convenient. 

In our definition of Faug(w) in Equation (4.6), we only highlighted the 
dependence on w. There are two other quantities under our control, namely 
the amount of regularization, A, and the nature of the regularizer which we 
chose to be w'w. In general, the augmented error for a hypothesis h € H is 


Baugh, Xs) = Balh) + ŽO). (4.7) 


For weight decay, Q(h) = ww, which penalizes large weights. The penalty 
term has two components: the regularizer Q(h) (the type of regularization) 
which penalizes a particular property of h; and the regularization parameter A 
(the amount of regularization). The need for regularization goes down as the 
number of data points goes up, so we factored out Wi this allows the optimal 
choice for À to be less sensitive to N. This is just a redefinition of the A that 
we have been using, in order to make it a more stable parameter that is easier 
to interpret. Notice how Equation (4.7) resembles the VC bound (4.1) as we 
anticipated in the heuristic view of regularization. This is why we use the same 
notation Q for both the penalty on individual hypotheses 2(h) and the penalty 
on the whole set Q(H). The correspondence between the complexity of H and 
the complexity of an individual h will be discussed further in Section 5.1. 
The regularizer Q is typically fixed ahead of time, before seeing the data; 
sometimes the problem itself can dictate an appropriate regularizer. 


Exercise 4.6 


We have seen both the hard-order constraint and the soft-order constraint. 
Which do you expect to be more useful for binary classification using the 
perceptron model? [Hint: sign(w’x) = sign(aw’x) for any a > 0.) 


The optimal regularization parameter, however, typically depends on the data. 
The choice of the optimal A is one of the applications of validation, which we 
will discuss shortly. 


Example 4.2. Linear models with weight decay. Linear models are 
important enough that it is worthwhile to spell out the details of augmented 
error minimization in this case. From Exercise 4.4, the augmented error is 
(W — Wiin)™Z*Z(w — Win) + Aw7w + y” (1 — H)y 


Faug(w) = N 


where Z is the transformed data matrix and Wiin = (Z™Z)-1Z7y. The reader 
may verify, after taking the derivatives of Eaug and setting VwHaug = 0, that 


Wreg = (Z7Z + Al) *Z"y. 
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As expected, Wreg will go to zero as À + oo, due to the AI term. The predic- 
tions on the in-sample data are given by y = ZWyeg = H(A)y, where 


HAs AZ ZFA TZ, 


The matrix H(A) plays an important role in defining the effective complexity 
of a model. When A = 0, H is the hat matrix of Exercises 3.3 and 4.4, which 
satisfies H? = H and trace(H) = d+1. The vector of in-sample errors, which 
are also called residuals, is y — ý = (I— H(A))y, and the in-sample error Ein 
is Bin(Wreg) = py" (I — H(\))y. 7 


We can now apply weight decay regularization to the first overfitting example 
that opened this chapter. The results for different \’s are shown in Figure 4.5. 


A = 0.0001 A= 0.01 à=1 











Figure 4.5: Weight decay applied to Example 4.2 with different values for 
the regularization parameter A. The red fit gets flatter as we increase À. 


As you can see, even very little regularization goes a long way, but too much 
regularization results in an overly flat curve at the expense of in-sample fit. 
Another case we saw earlier is Example 4.1, where we fit a linear model to a 
sinusoid. The regularization used there was also weight decay, with \ = 0.1. 


4.2.3 Choosing a Regularizer: Pill or Poison? 


We have presented a number of ways to constrain a model: hard-order con- 
straints where we simply use a lower-order model, soft-order constraints where 
we constrain the parameters of the model, and augmented error where we add 
a penalty term to an otherwise unconstrained minimization of error. Aug- 
mented error is the most popular form of regularization, for which we need to 
choose the regularizer Q(h) and the regularization parameter À. 

In practice, the choice of Q is largely heuristic. Finding a perfect Q is as 
difficult as finding a perfect H. It depends on information that, by the very 
nature of learning, we don’t have. However, there are regularizers we can work 
with that have stood the test of time, such as weight decay. Some forms of 
regularization work and some do not, depending on the specific application 
and the data. Figure 4.5 illustrated that even the amount of regularization 
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Figure 4.6: Out-of-sample performance for the uniform and low-order reg- 
ularizers using model His, with o° = 0.5, Qf = 15 and N = 30. Overfitting 
occurs in the shaded region because lower Fin (lower A) leads to higher Eout. 
Underfitting occurs when J is too large, because the learning algorithm has 
too little flexibility to fit the data. 


has to be chosen carefully. Too much regularization (too harsh a constraint) 
leaves the learning too little flexibility to fit the data and leads to underfitting, 
which can be just as bad as overfitting. 

If so many choices can go wrong, why do we bother with regularization 
in the first place? Regularization is a necessary evil, with the operative word 
being necessary. If our model is too sophisticated for the amount of data 
we have, we are doomed. By applying regularization, we have a chance. By 
applying the proper regularization, we are in good shape. Let us experiment 
with two choices of a regularizer for the model 15 of 15th order polynomials, 
using the experimental design in Exercise 4.2: 


l 15 

1. A uniform regularizer: Qunif(w) = } 23-0 wi. 
15 

2. A low-order regularizer: Qiow(w) = } -3-0 qu? 


The first encourages all weights to be small, uniformly; the second pays more 
attention to the higher order weights, encouraging a lower order fit. Figure 4.6 
shows the performance for different values of the regularization parameter A. 
As you decrease 4, the optimization pays less attention to the penalty term and 
more to Ein, and so Ein will decrease (Problem 4.7). In the shaded region, Eout 
increases as you decrease Ejn (decrease À) — the regularization parameter is 
too small and there is not enough of a constraint on the learning, leading 
to decreased performance because of overfitting. In the unshaded region, the 
regularization parameter is too large, over-constraining the learning and not 
giving it enough flexibility to fit the data, leading to decreased performance 
because of underfitting. As can be observed from the figure, the price paid for 
overfitting is generally more severe than underfitting. It usually pays to be 
conservative. 
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Figure 4.7: Performance of the uniform regularizer at different levels of 
noise. The optimal A is highlighted for each curve. 


The optimal regularization parameter for the two cases is quite different 
and the performance can be quite sensitive to the choice of regularization 
parameter. However, the promising message from the figure is that though 
the behaviors are quite different, the performances of the two regularizers are 
comparable (around 0.76), if we choose the right A for each. 

We can also use this experiment to study how performance with regular- 
ization depends on the noise. In Figure 4.7(a), when o? = 0, no amount 
of regularization helps (i.e., the optimal regularization parameter is \ = 0), 
which is not a surprise because there is no stochastic or deterministic noise in 
the data (both target and model are 15th order polynomials). As we add more 
stochastic noise, the overall performance degrades as expected. Note that the 
optimal value for the regularization parameter increases with noise, which is 
also expected based on the earlier discussion that the potential to overfit in- 
creases as the noise increases; hence, constraining the learning more should 
help. Figure 4.7(b) shows what happens when we add deterministic noise, 
keeping the stochastic noise at zero. This is accomplished by increasing Q; 
(the target complexity), thereby adding deterministic noise, but keeping ev- 
erything else the same. Comparing parts (a) and (b) of Figures 4.7 provides 
another demonstration of how the effects of deterministic and stochastic noise 
are similar. When either is present, it is helpful to regularize, and the more 
noise there is, the larger the amount of regularization you need. 

What happens if you pick the wrong 
regularizer? To illustrate, we picked a 
regularizer which encourages large weights 
(weight growth) versus weight decay which 
encourages small weights. As you can see, 
in this case, weight growth does not help 
the cause of overfitting. If we happened to 
choose weight growth as our regularizer, 
we would still be OK as long as we have Regularization Parameter, À 


| weight growth 





weight decay 





Expected Bout 
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a good way to pick the regularization parameter — the optimal regularization 
parameter in this case is \ = 0, and we are no worse off than not regularizing. 
No regularizer will be ideal for all settings, or even for a specific setting since 
we never have perfect information, but they all tend to work with varying 
success, if the amount of regularization à is set to the correct level. Thus, the 
entire burden rests on picking the right A, a task that can be addressed by a 
technique called validation, which is the topic of the next section. 

The lesson learned is that some form of regularization is necessary, as learn- 
ing is quite sensitive to stochastic and deterministic noise. The best way to 
constrain the learning is in the ‘direction’ of the target function, and more 
of a constraint is needed when there is more noise. Even though we don’t 
know either the target function or the noise, regularization helps by reducing 
the impact of the noise. Most common models have hypothesis sets which are 
naturally parameterized so that smaller parameters lead to smoother hypothe- 
ses. Thus, a weight decay type of regularizer constrains the learning towards 
smoother hypotheses. This helps, because stochastic noise is ‘high frequency’ 
(non-smooth). Similarly, deterministic noise (the part of the target function 
which cannot be modeled) also tends to be non-smooth. Thus, constraining 
the learning towards smoother hypotheses ‘hurts’ our ability to overfit the 
noise more than it hurts our ability to fit the useful information. These are 
empirical observations, not theoretically justifiable statements. 


Regularization and the VC dimension. Regularization (for example 
soft-order selection by minimizing the augmented error) poses a problem for 
the VC line of reasoning. As A goes up, the learning algorithm changes but 
the hypothesis set does not, so dy, will not change. We argued that A f in 
the augmented error corresponds to C | in the soft-order constrained model. 
So, more regularization corresponds to an effectively smaller model, and we 
expect better generalization for a small increase in Ein even though the VC 
dimension of the model we are actually using with augmented error does not 
change. This suggests a heuristic that works well in practice, which is to use an 
‘effective VC dimension’ instead of the VC dimension. For linear perceptrons, 
the VC dimension equals the number of free parameters d+1, and so an effec- 
tive number of parameters is a good surrogate for the VC dimension in the VC 
bound. The effective number of parameters will go down as A increases, and 
so the effective VC dimension will reflect better generalization with increased 
regularization. Problems 4.13, 4.14, and 4.15 explore the notion of an effective 
number of parameters. 


4.3 Validation 
So far, we have identified overfitting as a problem, noise (stochastic and deter- 


ministic) as a cause, and regularization as a cure. In this section, we introduce 
another cure, called validation. One can think of both regularization and val- 


137 





4. OVERFITTING 4.3. VALIDATION 











idation as attempts at minimizing Foy, rather than just Ein. Of course the 
true Eout is not available to us, so we need an estimate of F,,4 based on in- 
formation available to us in sample. In some sense, this is the Holy Grail of 
machine learning: to find an in-sample estimate of the out-of-sample error. 
Regularization attempts to minimize Foy, by working through the equation 


Eout(h) = Ein(h) + overfit penalty, 
a pe 


regularization estimates this quantity 


and concocting a heuristic term that emulates the penalty term. Validation, 
on the other hand, cuts to the chase and estimates the out-of-sample error 
directly. 

Eout (h) = Ejin(h) + overfit penalty. 

— 


validation estimates this quantity 


Estimating the out-of-sample error directly is nothing new to us. In Sec- 
tion 2.2.3, we introduced the idea of a test set, a subset of D that is not 
involved in the learning process and is used to evaluate the final hypothesis. 
The test error Frest, unlike the in-sample error Fin, is an unbiased estimate 
of Eout. 


4.3.1 The Validation Set 


The idea of a validation set is almost identical to that of a test set. We 
remove a subset from the data; this subset is not used in training. We then 
use this held-out subset to estimate the out-of-sample error. The held-out set 
is effectively out-of-sample, because it has not been used during the learning. 

However, there is a difference between a validation set and a test set. 
Although the validation set will not be directly used for training, it will be 
used in making certain choices in the learning process. The minute a set affects 
the learning process in any way, it is no longer a test set. However, as we will 
see, the way the validation set is used in the learning process is so benign that 
its estimate of Foy, remains almost intact. 

Let us first look at how the validation set is created. The first step is 
to partition the data set D into a training set Drain of size (N — K) and a 
validation set Dya of size K. Any partitioning method which does not depend 
on the values of the data points will do; for example, we can select N — K 
points at random for training and the remaining for validation. 

Now, we run the learning algorithm using the training set Drain to obtain 
a final hypothesis g € H, where the ‘minus’ superscript indicates that some 
data points were taken out of the training. We then compute the validation 
error for g using the validation set Dyal: 
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where e (g (x), y) is the pointwise error measure which we introduced in Sec- 
tion 1.4.1. For classification, e(g(x), y) = [g (x) # y] and for regression using 
squared error, e(g(x), y) = (g(x) — y)? 

The validation error is an unbiased estimate of Eout because the final hy- 
pothesis g was created independently of the data points in the validation set. 
Indeed, taking the expectation of Eya with respect to the data points in Dyal, 


Epa Eal] = a So Eva lelen), y), 


Xn E€ Dyal 


= = ` Eout (7), 


Xn EDval 
= Eol). (4.8) 


The first step uses the linearity of expectation, and the second step follows 
because e (g9 (Xn), Yn) depends only on Xn and so 


Ep,,, [e (7 (Xn); Yn)| = Ex, [e (9 (Xn), Yn) aa Eout(g ): 


How reliable is Eya at estimating Eout? In the case of classification, one can 
use the VC bound to predict how good the validation error is as an estimate for 
the out-of-sample error. We can view Dyal as an ‘in-sample’ data set on which 
we computed the error of the single hypothesis g. We can thus apply the 
VC bound for a finite model with one hypothesis in it (the Hoeffding bound). 
With high probability, 


Pot (T) < Eyal g) +O (+e) ; (4.9) 


While Inequality (4.9) applies to binary target functions, we may use the 
variance of Ey.) as a more generally applicable measure of the reliability. The 
next exercise studies how the variance of Eva) depends on K (the size of the 
validation set), and implies that a similar bound holds for regression. The 
conclusion is that the error between Eyai(g-) and Eout( g) drops as o(g-)/ VK, 
where o(g-) is bounded by a constant in the case of classification. 


Exercise 4.7 


Fix g (learned from Dtrain) and define o2,, Varp. [Eva (g )]: We con- 
sider how a2, depends on K. Let 


o° (F) = Varxle(g (x), y)] 


be the pointwise variance in the out-of-sample error of g`. 
(a) Show that ofa = #07(). 
(b) In a classification problem, where e(g (x), y) = [g (x) Æ y], express 
a2 in terms of P|g (x) 4 y]. 


(c) Show that for any g in a classification problem, oa) < 3k. 


(continued on next page) 
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(d) Is there a uniform upper bound for Var[Eyai(g")] similar to (c) in 
the case of regression with squared error e(g (x), y) = (J (x) — y)?? 
[Hint: The squared error is unbounded. ] 


(e) For regression with squared error, if we train using fewer points 
(smaller N — K) to get g, do you expect o?(g-) to be higher or 
lower? [Hint: For continuous, non-negative random variables, higher 
mean often implies higher variance.] 


(f) Conclude that increasing the size of the validation set can result in a 
better or a worse estimate of Eout. 


The expected validation error for Ha is illustrated in Figure 4.8, where we 
used the experimental design in Exercise 4.2, with Qs = 10, N = 40 and noise 
level 0.4. The expected validation error equals Eoui(g), per Equation (4.8). 


Expected Eyal 





30 


10 20 
Size of Validation Set, K 


Figure 4.8: The expected validation error E[Eyai(g)] as a function of K; 
the shaded area is E[Fyai] + ova. 


The figure clearly shows that there is a price to be paid for setting aside K 
data points to get this unbiased estimate of Eout: when we set aside more 
data for validation, there are fewer training data points and so g becomes 
worse; Eout(g), and hence the expected validation error, increases (the blue 
curve). As we expect, the uncertainty in Eya; as measured by Gyal (size of the 
shaded region) is decreasing with K, up to the point where the variance o? (g`) 
gets really bad. This point comes when the number of training data points 
becomes critically small, as in Exercise 4.7(e). If K is neither too small nor 
too large, Ey.) provides a good estimate of Eout. A rule of thumb in practice 
is to set K = Ẹ (set aside 20% of the data for validation). 

We have established two conflicting demands on K. It has to be big enough 
for Ey; to be reliable, and it has to be small enough so that the training set 
with N —K points is big enough to get a decent g. Inequality (4.9) quantifies 
the first demand. The second demand is quantified by the learning curve 
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discussed in Section 2.3.2 (also the blue curve in Figure 4.8, from right to left), 
which shows how the expected out-of-sample error goes down as the number 
of training data points goes up . The fact that more training data lead to a 
better final hypothesis has been extensively verified empirically, although it is 
challenging to prove theoretically. 





Restoring D. Although the learning curve > E 
suggests that taking out K data points for D 

validation and using only N — K for train- (N) 

ing will cost us in terms of Eout, we do not 
have to pay that price! The purpose of vali- 
dation is to estimate the out-of-sample per- | 
formance, and Ey.) happens to be a good GF Dyal 
estimate of Eout(g). This does not mean (K) 
that we have to output g as our final hy- 
pothesis. The primary goal is to get the 
best possible hypothesis, so we should out- E = 
put g, the hypothesis trained on the en- g val (J) 
tire set D. The secondary goal is to esti- Figure 4.9: Using a valida- 
mate Eout, which is what validation allows tion set to estimate Fout. 
us to do. Based on our discussion of learn- 

ing curves, Bout(g) < Eoul g), so 











Bout (9) < Eoul) < Evald) +0 (Fe). (4.10) 


The first inequality is subdued because it was not rigorously proved. If we first 
train with N — K data points, validate with the remaining K data points and 
then retrain using all the data to get g, the validation error we got will likely 
still be better at estimating Eout(g) than the estimate using the VC-bound 
with Fin(g), especially for large hypothesis sets with big dvo. 

So far, we have treated the validation set as a way to estimate Eout, without 
involving it in any decisions that affect the learning process. Estimating Fout 
is a useful role by itself — a customer would typically want to know how good 
the final hypothesis is (in fact, the inequalities in (4.10) suggest that the 
validation error is a pessimistic estimate of Ey, so your customer is likely to 
be pleasantly surprised when he tries your system on new data). However, as 
we will see next, an important role of a validation set is in fact to guide the 
learning process. That’s what distinguishes a validation set from a test set. 


4.3.2 Model Selection 


By far, the most important use of validation is for model selection. This could 
mean the choice between a linear model and a nonlinear model, the choice of 
the order of polynomial in a model, the choice of the value of a regularization 
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Figure 4.10: Optimistic bias of the validation error when using a validation 
set for the model selected. 


parameter, or any other choice that affects the learning process. In almost 
every learning situation, there are some choices to be made and we need a 
principled way of making these choices. 

The leap is to realize that validation can be used to estimate the out-of- 
sample error for more than one model. Suppose we have M models H1,..., Hiv. 
Validation can be used to select one of these models. Use the training set Dirain 
to learn a final hypothesis g,, for each model. Now evaluate each model on 
the validation set to obtain the validation errors F1,::+ ,E,,, where 


Em = Page eS lacey lh 


The validation errors estimate the out-of-sample error Eour(g,,) for each Hm. 


Exercise 4.8 


Is Em an unbiased estimate for the out-of-sample error Eout (Gm)? 


It is now a simple matter to select the model with lowest validation error. 
Let m* be the index of the model which achieves the minimum validation 
error. So for Hm», Em+ < Em for m = 1,..., M. The model Hm= is the model 
selected based on the validation errors. Note that Em» is no longer an unbiased 
estimate of Eout(g;,«). Since we selected the model with minimum validation 
error, Ey,+* will have an optimistic bias. This optimistic bias when selecting 
between H2 and Hs is illustrated in Figure 4.10, using the experimental design 
described in Exercise 4.2 with Q; = 3, o° = 0.4 and N = 35. 


Exercise 4.9 


Referring to Figure 4.10, why are both curves increasing with K? Why do 
they converge to each other with increasing K? 
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How good is the generalization error for this entire process of model selection 
using validation? Consider a new model Hyai consisting of the final hypotheses 
learned from the training data using each model H),..., Hm: 


Hva = {915 92 tee Guat 
Model selection using the validation set chose one of the hypotheses in Hval 
based on its performance on Dy). Since the model Hy») was obtained before 
ever looking at the data in the validation set, this process is entirely equivalent 
to learning a hypothesis from Hyva, using the data in Dy}. The validation 
errors E\yai(g;,) are ‘in-sample’ errors for this learning process and so we may 
apply the VC bound for finite hypothesis sets, with |Hyai| = M: 


Eout(Gm«) < Evat(Gm+) + O (vex) : (4.11) 


What if we didn’t use a validation set to choose the model? One alternative 
would be to use the in-sample errors from each model as the model selection 
criterion. Specifically, pick the model which gives a final hypothesis with min- 
imum in-sample error. This is equivalent to picking the hypothesis with mini- 
mum in-sample error from the grand model which contains all the hypotheses 
in each of the M original models. If we want a bound on the out-of-sample 
error for the final hypothesis that results from this selection, we need to apply 
the VC-penalty for this grand hypothesis set which is the union of the M 
hypothesis sets (see Problem 2.14). Since this grand hypothesis set can have 
a huge VC-dimension, the bound in (4.11) will generally be tighter. 

The goal of model selection is to se- 
lect the best model and output the best Hi Ho ti 
hypothesis from that model. Specifi- ++ 
cally, we want to select the model m for oe 


which Eout(gm) will be minimum when 














we retrain with all the data. Model se- Dyal 

lection using a validation set relies on the By Ertel 
leap of faith that if Fout(gm) is minimum, 

then Eout(g;,) is also minimum. The val- ns ene 
idation errors Em estimate Fout(g;,), So (Hm«, Em») 
modulo our leap of faith, the validation Ee 
set should pick the right model. No mat- 

ter which model m* is selected, however, Jm* 

based on the discussion of learning curves Figure 4.11: Using a validation 
in the previous section, we should not out- set for model selection 


put gm» as the final hypothesis. Rather, 
once m* is selected using validation, learn using all the data and output gm*, 
which satisfies 


Eout (Jm*) < Eout (Jm) < Evai lgm) ah O (y ng | ? (4.12) 


Again, the first inequality is subdued because we didn’t prove it. 
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Figure 4.12: Model selection between H2 and Hs using a validation set. The 
solid black line uses Ein for model selection, which always selects Hs. The 
dotted line shows the optimal model selection, if we could select the model 
based on the true out-of-sample error. This is unachievable, but a useful 
benchmark. The best performer is clearly the validation set, outputting gm*. 
For suitable K, even gm» is better than in-sample selection. 


Continuing our experiment from Figure 4.10, we evaluate the out-of-sample 
performance when using a validation set to select between the models H2 
and Hs. The results are shown in Figure 4.12. Validation is a clear winner 
over using Fi, for model selection. 


Exercise 4.10 


(a) From Figure 4.12, E[EFout(gm+)] is initially decreasing. How can this 
be, if E[Fout(gm)| is increasing in K for each m? 

(b) From Figure 4.12 we see that E[Fout(gm*)] is initially decreasing, and 
then it starts to increase. What are the possible reasons for this? 


(c) When K = 1, E[Eout(Gm*)| < ElEout(gm*)]. How can this be, if the 
learning curves for both models are decreasing? 


Example 4.3. We can use a validation set to select the value of the reg- 
ularization parameter in the augmented error of (4.6). Although the most 
important part of a model is the hypothesis set, every hypothesis set has an 
associated learning algorithm which selects the final hypothesis g. Two mod- 
els may be different only in their learning algorithm, while working with the 
same hypothesis set. Changing the value of À in the augmented error changes 
the learning algorithm (the criterion by which g is selected) and effectively 
changes the model. 

Based on this discussion, consider the M different models corresponding to 
the same hypothesis set H but with M different choices for À in the augmented 
error. So, we have (H, A1), (H,A2),.-., (H, Am ) as our M different models. We 
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may, for example, choose \1 = 0, Az = 0.01, A3 = 0.02,..., Am = 10. Using a 
validation set to choose one of these M models amounts to determining the 
value of A to within a resolution of 0.01. LJ 


We have analyzed validation for model selection based on a finite number of 
models. If validation is used to choose the value of a parameter, for example A 
as in the previous example, then the value of M will depend on the resolution 
to which we determine that parameter. In the limit, the selection is actually 
among an infinite number of models since the value of A can be any real 
number. What happens to bounds like (4.11) and (4.12) which depend on M? 
Just as the Hoeffding bound for a finite hypothesis set did not collapse when 
we moved to infinite hypothesis sets with finite VC-dimension, bounds like 
(4.11) and (4.12) will not completely collapse either. We can derive VC-type 
bounds here too, because even though there are an infinite number of models, 
these models are all very similar; they differ only slightly in the value of À. As 
a rule of thumb, what matters is the number of parameters we are trying to 
set. If we have only one or a few parameters, the estimates based on a decent- 
sized validation set would be reliable. The more choices we make based on the 
same validation set, the more ‘contaminated’ the validation set becomes and 
the less reliable its estimates will be. The more we use the validation set to 
fine tune the model, the more the validation set becomes like a training set 
used to ‘learn the right model’; and we all know how limited a training set is 
in its ability to estimate Fout. 

You will be hard pressed to find a serious learning problem in which valida- 
tion is not used. Validation is a conceptually simple technique, easy to apply 
in almost any setting, and requires no specific knowledge about the details of 
a model. The main drawback is the reduced size of the training set, but that 
can be significantly mitigated through a modified version of validation which 
we discuss next. 


4.3.3 Cross Validation 


Validation relies on the following chain of reasoning, 


out (9) x Eout (7) a Eyal (T), 
(small K) (large K) 


which highlights the dilemma we face in trying to select K. We are going to 
output g. When K is large, there is a discrepancy between the two out-of- 
sample errors Eout(g’) (which Eya; directly estimates) and Eout(g) (which is 
the final error when we learn using all the data D). We would like to choose K 
as small as possible in order to minimize the discrepancy between Fout(g) 
and Eout(g); ideally K = 1. However, if we make this choice, we lose the 
reliability of the validation estimate as the bound on the RHS of (4.9) becomes 
huge. The validation error Eyai(g-) will still be an unbiased estimate of Eout( g`) 
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(g is trained on N — 1 points), but it will be so unreliable as to be useless 
since it is based on only one data point. This brings us to the cross validation 
estimate of out-of-sample error. We will focus on the leave-one-out version 
which corresponds to a validation set of size K = 1, and is also the easiest 
case to illustrate. More popular versions typically use larger K, but the essence 
of the method is the same. 

There are N ways to partition the data into a training set of size N — 1 
and a validation set of size 1. Specifically, let 


Dn = (x1, y1), e.g (nai Uaa Carte (Xn+1, Yn+1), Ke (XN, yn) 


be the data set D after leaving out data point (Xn, Yn), which has been shaded 
in red. Denote the final hypothesis learned from D,, by gp. Let en be the error 
made by g, on its validation set which is just a single data point {(Xn, Yn)¥: 


en = val (9;,) =e (Gn, (Xn); Yn) : 


The cross validation estimate is the average value of the e,,’s, 


1X 
Las N he 








---------0 




















T T ax 


Figure 4.13: Illustration of leave-one-out cross validation for a linear 
fit using three data points. The average of the three red errors 
obtained by the linear fits leaving out one data point at a time is Eev. 


Figure 4.13 illustrates cross validation on a simple example. Each e, is a 
wild, yet unbiased estimate for the corresponding Eout (gp), which follows after 
setting K = 1 in (4.8). With cross validation, we have N functions gj,..., JN 
together with the N error estimates e,,...,ey. The hope is that these N 
errors together would be almost equivalent to estimating E ut on a reliable 
validation set of size N, while at the same time we managed to use N — 1 
points to obtain each gp. Let’s try to understand why Eey is a good estimator 
of Eout- 
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First and foremost, Eey is an unbiased estimator of ‘Four(g)’. We have 
to be a little careful here because we don’t have a single hypothesis g`, as we 
did when using a single validation set. Depending on the (xn, Yn) that was 
taken out, each g, can be a different hypothesis. To understand the sense in 
which Eev estimates Fout, we need to revisit the concept of the learning curve. 

Ideally, we would like to know Fou: (g). The final hypothesis g is the result 
of learning on a random data set D of size N. It is almost as useful to know the 
expected performance of your model when you learn on a data set of size N; 
the hypothesis g is just one such instance of learning on a data set of size N. 
This expected performance averaged over data sets of size N, when viewed 
as a function of N, is exactly the learning curve shown in Figure 4.2. More 
formally, for a given model, let 


Pout (N) = Ep [Bout (9) 


be the expectation (over data sets D of size N) of the out-of-sample error 
produced by the model. The expected value of Eey is exactly Eou(N — 1). 
This is true because it is true for each individual validation error ep: 


Ep [en] = Ep, Elxnyn) le(gn (Xn), Yn) ) 
Ep, [Eou (9,)]; 
Eout(N — 1). 


| 


Since this equality holds for each eņ, it also holds for the average. We highlight 
this result by making it a theorem. 


Theorem 4.4. Eey is an unbiased estimate of Eous(N — 1) (the expectation 
of the model performance, E[Eout], over data sets of size N — 1). 


Now that we have our cross validation 





estimate of Eout, there is no need to out- 

put any of the g, as our final hypothesis. | Dı D -.-- oO, 
We might as well squeeze every last drop D | | | 
of performance and retrain using the entire 4 om S| oe 
data set D, outputting g as the final hy- I1 92 IN 
pothesis and getting the benefit of going i w| awa) emm) 
from N — 1 to N on the learning curve. e1 e2 er eN 
In this case, the cross validation estimate BE eoa 











will on average be an upper estimate for =~ 
the out-of-sample error: Eout(g) < Eev, sO 


; E 
expect to be pleasantly surprised, albeit g m 
slightly. Figure 4.14: Using cross vali- 
With just simple validation and a val- dation to estimate Eout 


idation set of size K = 1, we know that 
the validation estimate will not be reliable. How reliable is the cross valida- 
tion estimate Eey? We can measure the reliability using the variance of Eey. 
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Unfortunately, while we were able to pin down the expectation of Eev, the 
variance is not so easy. 

If the N cross validation errors e;,...,@y were equivalent to N errors on a 
totally separate validation set of size N, then Eey would indeed be a reliable 
estimate, for decent-sized N. The equivalence would hold if the individual e,,’s 
were independent of each other. Of course, this is too optimistic. Consider 
two validation errors en, €m. The validation error en depends on g, which was 
trained on data containing (Xm, Ym). Thus, en has a dependency on (Xm, Ym). 
The validation error €m is computed using (Xm,Ym) directly, and so it also 
has a dependency on (Xm, Ym). Consequently, there is a possible correlation 
between en and em through the data point (km, Ym). That correlation wouldn’t 
be there if we were validating a single hypothesis using N fresh (independent) 
data points. 

How much worse is the cross validation estimate as compared to an esti- 
mate based on a truly independent set of N validation errors? A VC-type 
probabilistic bound, or even computation of the asymptotic variance of the 
cross validation estimate (Problem 4.23), is challenging. One way to quantify 
the reliability of Eey is to compute how many fresh validation data points 
would have a comparable reliability to Eev, and Problem 4.24 discusses one 
way to do this. There are two extremes for this effective size. On the high end 
is N, which means that the cross validation errors are essentially independent. 
On the low end is 1, which means that Eey is only as good as any single one 
of the individual cross validation errors en, i.e., the cross validation errors are 
totally dependent. While one cannot prove anything theoretically, in practice 
the reliability of Eev is much closer to the higher end. 


En Eev 
P S O O 
1 N 


Effective number of fresh examples 
giving a comparable estimate of Kout 


Cross validation for model selection. In Figure 4.11, the estimates Em 
for the out-of-sample error of model Hm were obtained using the validation set. 
Instead, we may use cross validation estimates to obtain Em: use cross valida- 
tion to obtain estimates of the out-of-sample error for each model Hj,..., HM, 
and select the model with the smallest cross validation error. Now, train this 
model selected by cross validation using all the data to output a final hypoth- 
esis, making the usual leap of faith that Eout(g `) tracks Eout(g) well. 


Example 4.5. In Figure 4.13, we illustrated cross validation for estimat- 
ing Eout of a linear model (h(x) = ax + b) using a simple experiment with 
three data points generated from a constant target function with noise. We 
now consider a second model, the constant model (h(x) = b). We can also 
use cross validation to estimate Fou: for the constant model, illustrated in 
Figure 4.15. 
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Figure 4.15: Leave-one-out cross validation error for a constant fit. 


If we use the in-sample error after fitting all the data (three points), then 
the linear model wins because it can use its additional degree of freedom to 
fit the data better. The same is true with the cross validation data sets of size 
two — the linear model has perfect in-sample error. But, with cross validation, 
what matters is the error on the outstanding point in each of these fits. Even 
to the naked eye, the average of the cross validation errors is smaller for the 
constant model which obtained Eey = 0.065 versus Eey = 0.184 for the linear 
model. The constant model wins, according to cross validation. The constant 
model also has lower Foy; and so cross validation selected the correct model 
in this example. O 


One important use of validation is to estimate the optimal regularization 
parameter À, as described in Example 4.3. We can use cross validation for the 
same purpose as summarized in the algorithm below. 


Cross validation for selecting à: 
: Define M models by choosing different values for A in the 
augmented error: (H,A1), (H, A2),.--,; (H, àm) 
: for each model m = 1,...,M do 
Use the cross validation module in Figure 4.14 to esti- 


mate E.y(m), the cross validation error for model m. 
: Select the model m* with minimum E,y(m*). 
: Use model (H, Am*) and all the data D to obtain the fi- 
nal hypothesis gm». Effectively, you have estimated the 
optimal A. 








We see from Figure 4.14 that estimating Eev for just a single model requires N 
rounds of learning on D,,...,Dy, each of size N — 1. So the cross validation 
algorithm above requires MN rounds of learning. This is a formidable task. 
If we could analytically obtain Eev, that would be a big bonus, but analytic 
results are often difficult to come by for cross validation. One exception is 
in the case of linear models, where we are able to derive an exact analytic 
formula for the cross validation estimate. 
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Analytic computation of Eey for linear models. Recall that for linear 
regression with weight decay, Wreg = (Z7™Z + AI)-!ZTy, and the in-sample 
predictions are 

y = H(A)y, 
where H(A) = Z(Z7™Z + AI) tZ". Given H, f, and y, it turns out that we can 
analytically compute the cross validation estimate as: 


LS omn V 
Ea == a | 4.1 
N 2 (; a) (e18) 


Notice that the cross validation estimate is very similar to the in-sample error, 
Ein = + >, (Gn — Yn)”, differing only by a normalization of each term in the 
sum by a factor 1/(1 — Hnn(A))?. One use for this analytic formula is that it 
can be directly optimized to obtain the best regularization parameter À. A 
proof of this remarkable formula is given in Problem 4.26. 


Even when we cannot derive such an analytic characterization of cross 
validation, the technique widely results in good out-of-sample error estimates 
in practice, and so the computational burden is often worth enduring. Also, 
as with using a validation set, cross validation applies in almost any setting 
without requiring specific knowledge about the details of the models. 

So far, we have lived in a world of unlimited computation, and all that 
mattered was out-of-sample error; in reality, computation time can be of con- 
sequence, especially with huge data sets. For this reason, leave-one-out cross 
validation may not be the method of choice.4 A popular derivative of leave- 
one-out cross validation is V-fold cross validation.” In V-fold cross validation, 
the data are partitioned into V disjoint sets (or folds) D,,..., Dy, each of size 
approximately N/V; each set D, in this partition serves as a validation set to 
compute a validation error for a hypothesis g learned on a training set which 
is the complement of the validation set, D \D,. So, you always validate a 
hypothesis on data that was not used for training that particular hypothesis. 
The V-fold cross validation error is the average of the V validation errors that 
are obtained, one from each validation set D,. Leave-one-out cross validation 
is the same as N-fold cross validation. The gain from choosing V < N is 
computational. The drawback is that you will be estimating Fou, for a hy- 
pothesis g trained on less data (as compared with leave-one-out) and so the 
discrepancy between Fout(g) and Eou (g) will be larger. A common choice in 
practice is 10-fold cross validation, and one of the folds is illustrated below. 


D 
ae, 
Dı D? D3 Ds Ds De Dr Dg Do Dio 
fi mmm N E O S 

train validate train 


Stability problems have also been reported in leave-one-out. 
5Some authors call it K-fold cross validation, but we choose V so as not to confuse with 
the size of the validation set K. 
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4.3.4 Theory Versus Practice 


Both validation and cross validation present challenges for the mathematical 
theory of learning, similar to the challenges presented by regularization. The 
theory of generalization, in particular the VC analysis, forms the foundation 
for learnability. It provides us with guidelines under which it is possible to 
make a generalization conclusion with high probability. It is not straightfor- 
ward, and sometimes not possible, to rigorously carry these conclusions over 
to the analysis of validation, cross validation, or regularization. What is pos- 
sible, and indeed quite effective, is to use the theory as a guideline. In the 
case of regularization, constraining the choice of a hypothesis leads to bet- 
ter generalization, as we would intuitively expect, even if the hypothesis set 
remains technically the same. In the case of validation, making a choice for 
few parameters does not overly contaminate the validation estimate of Eout, 
even if the VC guarantee for these estimates is too weak. In the case of cross 
validation, the benefit of averaging several validation errors is observed, even 
if the estimates are not independent. 

Although these techniques were based on sound theoretical foundation, 
they are to be considered heuristics because they do not have a full mathe- 
matical justification in the general case. Learning from data is an empirical 
task with theoretical underpinnings. We prove what we can prove, but we use 
the theory as a guideline when we don’t have a conclusive proof. In a practical 
application, heuristics may win over a rigorous approach that makes unrealis- 
tic assumptions. The only way to be convinced about what works and what 
doesn’t in a given situation is to try out the techniques and see for yourself. 
The basic message in this chapter can be summarized as follows. 


1. Noise (stochastic or deterministic) affects learning 
adversely, leading to overfitting. 


. Regularization helps to prevent overfitting by con- 
straining the model, reducing the impact of the noise, 


while still giving us flexibility to fit the data. 


. Validation and cross validation are useful techniques 
for estimating Eout. One important use of valida- 
tion is model selection, in particular to estimate the 
amount of regularization to use. 











Example 4.6. We illustrate validation on the handwritten digit classification 
task of deciding whether a digit is 1 or not (see also Example 3.1) based on the 
two features which measure the symmetry and average intensity of the digit. 
The data is shown in Figure 4.16(a). 
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(a) Digits classification task (b)Error curves 


Figure 4.16: (a) The digits data of which 500 are selected as the training 
set. (b) The data are transformed via the 5th order polynomial transform 
to a 20-dimensional feature vector. We show the performance curves as we 
vary the number of these features used for classification. 


We have randomly selected 500 data points as the training data and the 
remaining are used as a test set for evaluation. We considered a nonlinear 
feature transform to a 5th order polynomial feature space: 


2 E: 5 4 3m2 2m3 4 5 
(Lerm) FCN, tigt anar a ata 8 oe 1) 


Figure 4.16(b) shows the in-sample error as you use more of the transformed 
features, increasing the dimension from 1 to 20. As you add more dimensions 
(increase the complexity of the model), the in-sample error drops, as expected. 
The out-of-sample error drops at first, and then starts to increase, as we hit 
the approximation-generalization tradeoff. The leave-one-out cross validation 
error tracks the behavior of the out-of-sample error quite well. If we were to 
pick a model based on the in-sample error, we would use all 20 dimensions. 
The cross validation error is minimized between 5-7 feature dimensions; we 
take 6 feature dimensions as the model selected by cross validation. The table 
below summarizes the resulting performance metrics: 


Ei Eout 
0% 2.5% 
0.8% 1.5% 


Cross validation results in a performance improvement of about 1%, which is 
a massive relative improvement (40% reduction in error rate). 





No Validation 
Cross Validation 


Exercise 4.11 


In this particular experiment, the black curve (Eev) is sometimes below and 
sometimes above the the red curve (Hout). If we repeated this experiment 
many times, and plotted the average black and red curves, would you expect 
the black curve to lie above or below the red curve? 
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It is illuminating to see the actual classification boundaries learned with and 
without validation. These resulting classifiers, together with the 500 in-sample 


data points, are shown in the next figure. 


Symmetry 





* 


Average Intensity 
20-dim classifier (no validation) 6-dim classifier (LOO-CV) 
Ein = 0% Ein = 0.8% 
Eout = 2.5% Eout = 1.5% 


Average Intensity 


It is clear that the worse out-of-sample performance of the classifier picked 
without validation is due to the overfitting of a few noisy points in the training 
data. While the training data is perfectly separated, the shape of the resulting 
boundary seems highly contorted, which is a symptom of overfitting. Does this 
remind you of the first example that opened the chapter? There, albeit in a 
toy example, we similarly obtained a highly contorted fit. As you can see, 


overfitting is real, and here to stay! o 
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4.4 Problems 


Problem 4.1 Plot the monomials of order i, ¢;(a) = zt. As you increase 
the order, does this correspond to the intuitive notion of increasing complexity? 


Problem 4.2 Consider the feature transform z = [Lo(x), L1 (£), Lo(«)]" 
and the linear model h(x) = w"z. For the hypothesis with w = [1,—1,1]’, 
what is h(x) explicitly as a function of 2. What is its degree? 


Problem 4.3 The Legendre Polynomials are a family of orthogonal 
polynomials which are useful for regression. The first two Legendre Polynomials 
are Lo(x) = 1, Li(x) = x. The higher order Legendre Polynomials are defined 
by the recursion: 


_ 2k-1 k-1 








Lgl) tLp-1(x) — 7 Lrk-2(x). 

(a) What are the first six Legendre Polynomials? Use the recursion to de- 
velop an efficient algorithm to compute Lo(x),..., Lg(x) given x. Your 
algorithm should run in time linear in K. Plot the first six Legendre 
polynomials. 

(b) Show that L,(2) is a linear combination of monomials z”, 2*~?,... (ei- 


ther all odd or all even order, with highest order k). Thus, 
Ly(—a) = (—1)* Ly (2). 


(c) Show that at a1 dbp (2) = tL, (x) — Lr-1(x). [Hint: use induction.] 
(d) Use part (c) to show that Lẹ satisfies Legendre's differential equation 
d, 9 dL, (x) 
daz (= =1) da 
This means that the Legendre Polynomials are eigenfunctions of a Her- 


mitian linear differential operator and, from Sturm-Liouville theory, they 
form an orthogonal basis for continuous functions on [—1, 1]. 





= k(k + 1)Lp(2). 


(e) Use the recurrence to show directly the orthogonality property: 


[ « AOR E f ean 


zo OP =k. 


[Hint: use induction on k, with £ < k. Use the recurrence for Ly and 
consider separately the four cases = k,k — 1,k — 2 and £ < k — 2. For 
the case £= k you will need to compute the integral E, dx z’ L(x). 
In order to do this, you could use the differential equation in part (c), 
multiply by xL and then integrate both sides (the LHS can be integrated 
by parts). Now solve the resulting equation for a da xz? L2_,(x).] 
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Problem 4.4 LAMi This problem is a detailed version of Exercise 4.2. 

We set up an experimental framework which the reader may use to study var- 

ious aspects of overfitting. The input space is Æ = [—1,1], with uniform 
1 


input probability density, P(x) = 5. We consider the two models H2 and 


Hio. The target function is a polynomial of degree Qf, which we write as 
eo hee ee dqLq(x), where L(x) are the Legendre polynomials. We use 
the Legendre polynomials because they are a convenient orthogonal basis for the 
polynomials on [—1, 1] (see Section 4.2 and Problem 4.3 for some basic infor- 
mation on Legendre polynomials). The data set is D = (21, y1),..., (EN, yn), 
where Yn = f (£n) + dEn and en are iid standard Normal random variates. 


For a single experiment, with specified values for Qz, N,o, generate a random 
degree-Q,y target function by selecting coefficients aq independently from a 
standard Normal, rescaling them so that Ea,z [f?] = 1. Generate a data set, 
selecting 71,...,2N independently from P(x) and yn = f(£n) + cen. Let g2 
and gio be the best fit hypotheses to the data from H2 and Ho respectively, 
with respective out-of-sample errors Eout(g2) and Bout (gio). 


(a) Why do we normalize f? [Hint: how would you interpret o 7] 


(b) How can we obtain go, g10? [Hint: pose the problem as linear regression 
and use the technology from Chapter 3.] 


(c) How can we compute Eout analytically for a given gio? 


(d) Vary Qf, N,c and for each combination of parameters, run a large num- 
ber of experiments, each time computing out (ge) and Eout(gio). Aver- 
aging these out-of-sample errors gives estimates of the expected out-of- 
sample error for the given learning scenario (Q, N,o) using H2 and Ho. 
Let 


Eout(H2) = average over experiments(Fout(g2)), 


Eout(Hio0) = average over experiments(Eout(gio)). 


Define the overfit measure Eout(Hi0) — Eout(H2). When is the over- 
fit measure significantly positive (i.e., overfitting is serious) as opposed 
to significantly negative? Try the choices Qs € {1,2,...,100}, N € 
{20,25,..., 120}, o? € {0,0.05,0.1,..., 2}. 

Explain your observations. 


(e) Why do we take the average over many experiments? Use the variance 
to select an acceptable number of experiments to average over. 


(f) Repeat this experiment for classification, where the target function is a 
noisy perceptron, f = sign (ees Gadig(®) +e). Notice that ao = 0, 


and the a,'s should be normalized so that Ea,2 [ones agLq(2))?| =], 
For classification, the models H2, Hio contain the sign of the 2nd and 
10th order polynomials respectively. You may use a learning algorithm 


for non-separable data from Chapter 3. 
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Problem 4.5 IfA < 0 in the augmented error Faug(w) = Ein(w)+Aw'w, 
what soft order constraint does this correspond to? [Hint: A < 0 encourages 
large weights. ] 


Problem 4.6 In the augmented error minimization with T = I and À > 0: 


(a) Show that ||Wreg|| < ||wiin||, justifying the term weight decay. /[Hint: 
start by assuming that ||Wreg|| > ||wiin|| and derive a contradiction.] 
In fact a stronger statement holds: ||Wreg|| is decreasing in A. 


(b) Explicitly verify this for linear models. [Hint: 
WrogWreg = u"(Z"Z + Al) ~7u, 


where u = Z"y and Z is the transformed data matrix. Show that Z2™Z + 
AI has the same eigenvectors with correspondingly larger eigenvalues as 
Z°Z. Expand u in the eigenbasis of Z™Z. For a matrix A, how are the 
eigenvectors and eigenvalues of A~? related to those of A?] 


Problem 4.7 Show that the in-sample error 
l r 
Bin (Wee) = 5y" (I~ H(A))?y 


from Example 4.2 is an increasing function of A, where H(A) = Z(Z™Z+A1)~*Z" 
and Z is the transformed data matrix. 


To do so, let the SVD of Z = UTV” and let Z™Z have eigenvalues o7,...,07. 
Define the vector a = Uy. Show that 





e o? : 
— F ; = X 2 2 
Ein(Wreg) = Ein(Win) F N a; (1 ) ’ 


and proceed from there. 


Problem 4.8 In the augmented error minimization with T = I and A > 0, 
assume that Ein is differentiable and use gradient descent to minimize Faug: 


w(t+ 1) + w(t) — nV Eaug(w(t)). 
Show that the update rule above is the same as 


w(t +1) = (1 —2nd)w(t) — nV Ein(w(t)). 


Note: This is the origin of the name ‘weight decay’: w(t) decays before being 
updated by the gradient of Fin. 
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Problem 4.9 In Tikhonov regularization, the regularized weights are given 
by Wreg = (Z7Z+ ATT) Zy. The Tikhonov regularizer I is a k x (d + 1) 
matrix, each row corresponding to a d + 1 dimensional vector. Each row of Z 
corresponds to a d+ 1 dimensional vector (the first component is 1). For each 
row of I’, construct a virtual example (z;,0) for i = 1,...,&, where z; is the 
vector obtained from the ith row of T after scaling it by VA, and the target 
value is 0. Add these k virtual examples to the data, to construct an augmented 
data set, and consider non-regularized regression with this augmented data. 


Z 
(a) Show that, for the augmented data, Zaug = vx A and Yaug = Hi 


(b) Show that solving the least squares problem with Zaug and Yaug results 
in the same regularized weight Wreg, ie. Wreg = (ZaugZaug)  ZaugYaug- 


This result may be interpreted as follows: an equivalent way to accomplish 
weight-decay-type regularization with linear models is to create a bunch of 
virtual examples all of whose target values are zero. 


Problem 4.10 In this problem, you will investigate the relationship 
between the soft order constraint and the augmented error. The regularized 
weight Wreg is a solution to 


min Ein(w) subject to w I"Tw < C. 


(a) If win T win < C, then what is Wreg? 
(b) If win T win > C, the situation is illustrated below, 





The constraint is satisfied in the shaded region and the contours of con- 
stant Ein are the ellipsoids (why ellipsoids?). What is Wreg ° T Wreg? 
(c) Show that with 
1 
Ac = -zg Wreg V Ein (Wreg), 


Wreg minimizes Ein(w) + Acw'I"Tw. [Hint: use the previous part to 
solve for Wreg as an equality constrained optimization problem using the 
method of Lagrange multipliers. ] 


(continued on next page) 
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(d) Show that the following hold for Ac: 
(i) If win] TWin < C then Ac = 0 (Win itself satisfies the constraint). 
(ii) If wia T Wiin > C, then Ac > 0 (the penalty term is positive). 
(iii) If want’ Ewin > C, then Ac is a strictly decreasing function of C. 


[Hint: show that Hg <0 for C € [0, win T Wwiin].] 


Problem 4.11 For the linear model in Exercise 4.2, the target function is 
a polynomial of degree Q; the model is Ho, with polynomials up to order Q. 
Assume Q > Qj. Win = (Z°Z)~'Z’y, and y = Zwe + e, where we is the 
target function and Z is the matrix containing the transformed data. 


(a) Show that win = wẹ + (Z7Z)~*Z™e. What is the average function g? 
Show that bias = 0 (recall that: bias(x) = (g(x) — f(x))’). 
(b) Show that 


2 
var = — trace (Xe Ez [(#2"Z)~*)) , 


where Ne = E[®(ax)®"(a)]. [Hints: var = Ef(gP? — g)°]; first take the 
expectation with respect to e, then with respect to ®(x), the test point, 
and the last remaining expectation will be with respect to Z. You will 
need the cyclic property of the trace.] 

o7(Q +1) 


(c) Argue that to first order in +, var © -r 


[Hint: ŁZ"Z = 4 SAL ®(an)®" (an) is the in-sample estimate of Da. 
By the law of large numbers, 7 Z"Z = De + o(1).] 


For the well specified linear model, the bias is zero and the variance is increasing 
as the model gets larger (Q increases), but decreasing in N. 


Problem 4,12 Use the setup in Problem 4.11 with Q > Qs. Con- 
sider regression with weight decay using a linear model H in the transformed 
space with input probability distribution such that E[zz"] = I. The regularized 
weights are given by Wreg = (Z7Z + AI) tZ"y, where y = Zw; + €. 
(a) Show that Wreg = wr — A(Z7™Z + AI) 1 we + (Z7Z + AI) Ze. 
(b) Argue that, to first order in 4, 
; `? 2 
bias Z Otel Ml 5 
2 


E [trace(H?(A))], 


Var 73 


es 
N 
where H(A) = Z(Z°Z + AI)~*Z". 
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If we plot the bias and var, we get a figure 

that is very similar to Figure 2.3, where Bic] = bias Wek 
the tradeoff was based on fit and com- 
plexity rather than bias and var. Here, the 
bias is increasing in À (as expected) and 
in ||wel|; the variance is decreasing in À. 
When à = 0, trace(H?(\)) = Q +1 and 
so trace(H?(X)) appears to be playing the 
role of an effective number of parameters. Regularization Parameter, À 


Error 





Problem 4.13 Within the linear regression setting, many attempts have 
been made to quantify the effective number of parameters in a model. Three 
possibilities are: 


(i) delà) = 2trace(H(A)) — trace(H?(A)) 
(ii) deg (A) = trace(H(A)) 
(iii) deg(A) = trace(H?(X)) 


where H(A) = Z(Z"Z + XI)~*Z" and Z is the transformed data matrix. To 
obtain deg, one must first compute H(A) as though you are doing regression. 
One can then heuristically use deg in place of dvo in the VC-bound. 


(a) When à = 0, show that for all three choices, des = d+1, where d is the 
dimension in the Z space. 


(b) When A > 0, show that 0 < deg < d+1 and deg is decreasing in À for 
all three choices. [Hint: Use the singular value decomposition. ] 


Problem 4.14 The observed target values y can be separated into the 
true target values f and the noise e, y = f + e. The components of e are iid 
with variance o? and expectation 0. For linear regression with weight decay 
regularization, by taking the expected value of the in-sample error in (4.2), 
show that 





Ee(Ein) = Sf- H(A))PE + T trace (I — H(A))?), 
1 T 2 2 def 
= ŠP- HAE +e (1- e), 


where deg = 2trace(H(A)) — trace(H?(A)), as defined in Problem 4.13(i), 
H(A) = Z(Z7Z + AI) tZ" and Z is the transformed data matrix. 


(continued on next page) 
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(a) If the noise was not overfit, what should the term involving o? be, and 
why? 


(b) Hence, argue that the degree to which the noise has been overfit is 
o dea /N. Interpret the dependence of this result on the parameters deg 
and N, to justify the use of deg as an effective number of parameters. 


Problem 4.15 We further investigate deg of Problems 4.13 and 4.14. We 
know that H(A) = Z(Z"Z + AIT) +Z". When T is square and invertible, as 
is usually the case (for example with weight decay, F = I), denote Z = ZI~?. 
Let sô, ... , s3 be the eigenvalues of ZZ (s? > 0 when Z has full column rank). 


(a) For deg(A) = trace(2H(A) — H?(A)), show that 


d 2 
degt(A) = cea 


d 
(b) For des(A) = trace(H(A)), show that deg(A) =d+1— >> wy: 
4 33 


do 4 
(c) For deal à) = trace(H?(A)), show that deg(A) = > TI 


In all cases, for A > 0, 0 < deg(à) < d+1, dew(0) = d+1 and deg is decreasing 
in A. [Hint: use the singular value decomposition Z = USV", where U, V are 
orthogonal and S is diagonal with entries s;.] 


Problem 4.16 For linear models and the general Tikhonov regularizer T 
with penalty term Aw Tw in the augmented error, show that 


wee = (22+ AIT) Zy, 
where Z is the feature matrix. 
(a) Show that the in-sample predictions are 
y = HO)y, 


where H(A) = Z(Z7Z + AIT) “IZT. 
(b) Simplify this in the case [ = Z and obtain Wreg in terms of Wiin. This is 
called uniform weight decay. 


Problem 4.17 To model uncertainty in the measurement of the inputs, 
assume that the true inputs X, are the observed inputs x, perturbed by some 
noise €n: the true inputs are given by n = Xn + En. Assume that the en 
are independent of (xn, yn) with covariance matrix Ele,€7,] = 21 and mean 
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Elen] = 0. The learning algorithm minimizes the expected in-sample error Ein, 
where the expectation is with respect to the uncertainty in the true Xn. 


Ein(w) = Ee,..en > 2 es — Yn) j . 


Show that the weights Win which result from minimizing Ein are equiva- 
lent to the weights which would have been obtained by minimizing Fin = 
* om (w7Xn — Yn)” for the observed data, with Tikhonov regularization. 
What are I’ and À (see Problem 4.16 for the general Tikhonov regularizer)? 


One can interpret this result as follows: regularization enforces a robustness to 
potential measurement errors (noise) in the observed inputs. 


Problem 4.18 In a regression setting, assume the target function is 
linear, so f(x) = w}x, and y = Zwy + e, where the entries in e are iid with 
zero mean and variance o°. Assume a regularization term Aw"Z" Zw and that 


E|xx”| = I. In this problem derive the optimal value for A as follows. 
(a) Show that the average function is g(x) = tox i (x). What is the bias? 


(b) Show that var is asymptotically oie. [Hint: Problem 4.12.] 


(c) Use the bias and asymptotic variance to obtain an expression for E[ out]. 
Optimize this with respect to to obtain the optimal regularization pa- 
* o? d+1) 
rameter. [Answer: \* = Nest 


(d) Explain the dependence of the optimal regularization parameter on the 


parameters of the learning problem. [Hint: write \* = DAUG J 


Problem 4.19 [The Lasso algorithm] Rather than a soft order constraint 
on the squares of the weights, one could use the absolute values of the weights: 


d 
min Ein(w) subject to ` jwi] < C. 


i=0 
The model is called the lasso algorithm. 


(a) Formulate and implement this as a quadratic program. Use the exper- 
imental design in Problem 4.4 to compare the lasso algorithm with the 
quadratic penalty by giving plots of Eout versus regularization parameter. 


(b) What is the augmented error? Is it more convenient to optimize? 


(c) With d = 5 and N = 3, compare the weights from the lasso versus the 
quadratic penalty. [Hint: Look at the number of non-zero weights.] 
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Problem 4.20 In this problem, you will explore a consistency condition 
for weight decay. Suppose that we make an invertible linear transform of the 
data, 

Zn = ÅXn, Ün = OY yx: 


Intuitively, linear regression should not be affected by a linear transform. This 
means that the new optimal weights should be given by a corresponding linear 
transform of the old optimal weights. 


(a) Suppose w minimizes the in-sample error for the original problem. Show 
that for the transformed problem, the optimal weights are 


w =a(A™)'w. 


(b) Suppose the regularization penalty term in the augmented error is 
w'" X" Xw for the original data and w*Z" Zw for the transformed data. 
On the original data, the regularized solution is Wreg(A). Show that for 
the transformed problem, the same linear transform of Wreg(A) gives the 
corresponding regularized weights for the transformed problem: 


Wreg(A) = a(A7) ~ Wreg (A). 


Problem 4.21 The Tikhonov smoothness penalty which penalizes 
2 
derivatives of h is Q(h) = f da (52) . Show that, for linear models, 


this reduces to a penalty of the form w* I” Tw. What is r? 


Problem 4.22 You have a data set with 100 data points. You have 
100 models each with VC-dimension 10. You set aside 25 points for validation. 
You select the model which produced minimum validation error of 0.25. Give 
a bound on the out-of-sample error for this selected function. 


Suppose you instead trained each model on all the data and selected the func- 
tion with minimum in-sample error. The resulting in-sample error is 0.15. Give 
a bound on the out-of-sample error in this case. [Hint: Use the bound in 
Problem 2.14 to bound the VC-dimension of the union of all the models.] 


Problem 4.23 This problem investigates the covariance of the leave-one- 
out cross validation errors, Covp[en, €m]. Assume that for well behaved models, 
the learning process is ‘stable’, and so the change in the learned hypothesis 
should be small, ‘O (a) if a new data point is added to a data set of size 
N. Write g, = g%~”? + ôn and gm = g%~? + ôm, where g7? is the 
learned hypothesis on DY -?), the data minus the nth and mth data points, 
and ôn, dm are the corrections after addition of the nth and mth data points 


respectively. 
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(a) Show that Varp[Eev] = yz 52L Varp[en] + E DN Covplen, €m]. 


(b) Show Covp[en, €m] = Varpen-2) |Eout (g7? )]+ higher order in ôn, ôm. 
(c) Assume that any terms involving ôn, ôm are O(+). Argue that 


1 1 
Varp|Eev] = y Varp [er] + Varp[Eout(g)] + O(g) 


Does Varp[ei] decay to zero with N? What about Varp[Fout(g)]? 


(d) Use the experimental design in Problem 4.4 to study Varp[Eev] and give 
a log-log-plot of Varp|Eev]/Varp[e1] versus N. What is the decay rate? 


Problem 4.24 For d = 3, generate a random data set with N points as 
follows. For each point, each dimension of x has a standard Normal distribution. 
Similarly, generate a (d+ 1)-dimensional target weight vector we, and set yn = 
W?Xn+0€n where en is noise (also from a standard Normal distribution) and o 
is the noise variance; set ø to 0.5. 


Use linear regression with weight decay regularization to estimate we with Wreg. 
Set the regularization parameter to 0.05/N. 


(a) For N € {d+15,d+25,...,d+115}, compute the cross validation errors 
€1,...,@n and Eev. Repeat the experiment (say) 10° times, maintaining 
the average and variance over the experiments of e1, e2 and Evy. 


(b) How should your average of the e1’s relate to the average of the Eev's; 
how about to the average of the e2's? Support your claim using results 
from your experiment. 


(c) What are the contributors to the variance of the e's? 


(d) If the cross validation errors were truly independent, how should the vari- 
ance of the e's relate to the variance of the Exy's? 


(e) One measure of the effective number of fresh examples used in comput- 
ing Eev is the ratio of the variance of the e1's to that of the Eev's. Explain 
why, and plot, versus N, the effective number of fresh examples (Nes) 
as a percentage of N. You should find that Neg is close to N. 


(f) If you increase the amount of regularization, will Nes go up or down? 
Explain your reasoning. Run the same experiment with à = 2.5/N and 
compare your results from part (e) to verify your conjecture. 


Problem 4.25 When using a validation set for model selection, all models 
were learned on the same Drain of size N — K, and validated on the same 
Dyal of size K. We have the VC-bound (see Equation (4.12)): 





In M 
Eout (Gm ) < Exailg )+0( 2K ) 


(continued on next page) 
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Suppose that instead, you had no control over the validation process. So M 
learners, each with their own models present you with the results of their val- 
idation processes on different validation sets. Here is what you know about 
each learner: 


Each learner m reports to you the size of their validation set Km, 
and the validation error Eyai(m). The learners may have used dif- 
ferent data sets, except that they faithfully learned on a training set 
and validated on a held out validation set which was only used for 
validation purposes. 


As the model selector, you have to decide which learner to go with. 


(a) Should you select the learner with minimum validation error? If yes, why? 
If no, why not? [Hint: think VC-bound.] 


(b) If all models are validated on the same validation set as described in the 
text, why is it okay to select the learner with the lowest validation error? 


(c) After selecting learner m* (say), show that 
P[Eout(m*) > Eya(m*) + ] < Me"), 


M 2? ; TEES 
where (e) = -z4 ln (4 Se ej is an “average” validation 


set size. 


(d) Show that with probability at least 1 — 6, Eout < Evai + €*, for any e” 


which satisfies e* > f Be. 


(e) Show that minm Km < K(e) < $ Tig Km. Is this bound better or 
worse than the bound when all models use the same validation set size 


(equal to the average validation set size + pee Km)? 


Problem 4.26 In this problem, derive the formula for the exact expression 
for the leave-one-out cross validation error for linear regression. Let Z be the 
data matrix whose rows correspond to the transformed data points Zn = (xn). 


(a) Show that: 
N N 
ZZ = Da Zn Zn} Z'y a bD ZnYn; Hnm (A) Ta zn AT (À)Zm, 
n=1 n=1 


where A = A(A) = ZZ + AI"T and H(A) = ZA(A)7*Z". Hence, 
show that when (Zn, Yn) is left out, ZZ = ZZ — anzz,,and Zy > 
ZY — ZnYn- 
(b) Compute w;,, the weight vector learned when the nth data point is left 
out, and show that: 
Av gage AT! 
- ua —1 nan Taal 
Wn = (a + 1 eni) (Z Y — ZnYn). 
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(c) 


(d) 


(e) 


[Hint: use the identity (A — xx")! = A`! + aoa 


Using (a) and (b), show that w, = w + P AT Zn, where w is the 


regression weight vector using all the data. 


The prediction on the validation point is given by z w7. Show that 
= ĝ D H n 
Z Wn = m ia aes r 


, 2 
Show that en = (455) , and hence prove Equation (4.13). 


Problem 4.27 Cross validation gives an accurate estimate of Eou (N —1), 
but it can be quite sensitive, leading to problems in model selection. A common 
heuristic for ‘regularizing’ cross validation is to use a measure of error cov (H) 
for the cross validation estimate in model selection. 


(a) 


(b) 
(c) 


One choice for coy is the standard deviation of the leave-one-out errors 


divided by VN, oc % Jy Vv var(er, <., €n). Why divide by VN? 


g 4 
For linear models, show that VNoov = 4 DaN (pa) Se. 


(i) Given the best model H*, the conservative one-sigma approach se- 
lects the simplest model within ocy(H*) of the best. 

(ii) The bound minimizing approach selects the model which minimizes 
Eov(H) F oov (H). 

Use the experimental design in Problem 4.4 to compare these approaches 

with the ‘unregularized’ cross validation estimate as follows. Fix Q; = 15, 

Q = 20, and ø = 1. Use each of the two methods proposed here as well as 

traditional cross validation to select the optimal value of the regularization 

parameter À in the range {0.05,0.10,0.15,...,5} using weight decay 

regularization, Q(w) = Aw'w. Plot the resulting out-of-sample error 

for the model selected using each method as a function of N, with N in 

the range {2 x Q,3 x Q,...,10 x Q}. 


What are your conclusions? 
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Chapter 5 


Three Learning Principles 


The study of learning from data highlights some general principles that are 
fascinating concepts in their own right. Having gone through the mathematical 
analysis and empirical illustrations of the first few chapters, we have a good 
foundation from which to articulate some of these principles and explain them 
in concrete terms. 

In this chapter, we will discuss three principles. The first one is related to 
the choice of model and is called Occam’s razor. The other two are related 
to data; sampling bias establishes an important principle about obtaining the 
data, and data snooping establishes an important principle about handling 
the data. A genuine understanding of these principles will protect you from 
the most common pitfalls in learning from data, and allow you to interpret 
generalization performance properly. 


5.1 Occam’s Razor 


Although it is not an exact quote of Einstein’s, it is often attributed to him 
that “An explanation of the data should be made as simple as possible, but no 
simpler.” A similar principle, Occam’s Razor, dates from the 14th century and 
is attributed to William of Occam, where the ‘razor’ is meant to trim down 
the explanation to the bare minimum that is consistent with the data. 

In the context of learning, the penalty for model complexity which was 
introduced in Section 2.2 is a manifestation of Occam’s razor. If Fin(g) = 0, 
then the explanation (hypothesis) is consistent with the data. In this case, 
the most plausible explanation, with the lowest estimate of Eout given in the 
VC bound (2.14), happens when the complexity of the explanation (measured 
by dyc(H)) is as small as possible. Here is a statement of the underlying 
principle. 


The simplest model that fits the data is also the most plausible. | 
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Applying this principle, we should choose as simple a model as we think we can 
get away with. Although the principle that simpler is better may be intuitive, 
it is neither precise nor self-evident. When we apply the principle to learning 
from data, there are two basic questions to be asked. 


1. What does it mean for a model to be simple? 


2. How do we know that simpler is better? 


Let’s start with the first question. There are two distinct approaches to defin- 
ing the notion of complexity, one based on a family of objects and the other 
based on an individual object. We have already seen both approaches in our 
analysis. The VC dimension in Chapter 2 is a measure of complexity, and it 
is based on the hypothesis set H as a whole, i.e., based on a family of objects. 
The regularization term of the augmented error in Chapter 4 is also a measure 
of complexity, but in this case it is the complexity of an individual object, 
namely the hypothesis h. 

The two approaches to defining complexity are not encountered only in 
learning from data; they are a recurring theme whenever complexity is dis- 
cussed. For instance, in information theory, entropy is a measure of complexity 
based on a family of objects, while minimum description length is a related 
measure based on individual objects. There is a reason why this is a recurring 
theme. The two approaches to defining complexity are in fact related. 

When we say a family of objects is complex, we mean that the family is 
‘big’. That is, it contains a large variety of objects. Therefore, each individual 
object in the family is one of many. By contrast, a simple family of objects is 
‘small’; it has relatively few objects, and each individual object is one of few. 

Why is the sheer number of objects an indication of the level of complexity? 
The reason is that both the number of objects in a family and the complexity 
of an object are related to how many parameters are needed to specify the 
object. When you increase the number of parameters in a learning model, you 
simultaneously increase how diverse H is and how complex the individual h is. 
For example, consider 17th order polynomials versus 3rd order polynomials. 
There is more variety in 17th order polynomials, and at the same time the 
individual 17th order polynomial is more complex than a 3rd order polynomial. 

The most common definitions of object complexity are based on the number 
of bits needed to describe an object. Under such definitions, an object is simple 
if it has a short description. Therefore, a simple object is not only intrinsically 
simple (as it can be described succinctly), but it also has to be one of few, 
since there are fewer objects that have short descriptions than there are that 
have long descriptions, as a matter of simple counting. 


Exercise 5.1 


Consider hypothesis sets Hı and H100 that contain Boolean functions on 10 
Boolean variables, so ¥ = {—1,+1}*°. Hı contains all Boolean functions 
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which evaluate to +1 on exactly one input point, and to —1 elsewhere; 
Hoo contains all Boolean functions which evaluate to +1 on exactly 100 
input points, and to —1 elsewhere. 


(a) How big (number of hypotheses) are Hı and H100? 
(b). How many bits are needed to specify one of the hypotheses in H1? 
(c) How many bits are needed to specify one of the hypotheses in H100? 


We now address the second question. When Occam’s razor says that simpler 
is better, it doesn’t mean simpler is more elegant. It means simpler has a 
better chance of being right. Occam’s razor is about performance, not about 
aesthetics. If a complex explanation of the data performs better, we will 
take it. 

The argument that simpler has a better chance of being right goes as fol- 
lows. We are trying to fit a hypothesis to our data D = {(x1, y1) +- , (XN, yn) } 
(assume y,,’s are binary). There are fewer simple hypotheses than there are 
complex ones. With complex hypotheses, there would be enough of them to 
shatter xX1,--- ,Xy, so it is certain that we can fit the data set regardless of 
what the labels y1,--- , yn are, even if these are completely random. There- 
fore, fitting the data does not mean much. If, instead, we have a simple model 
with few hypotheses and we still found one that perfectly fits the dichotomy 
D = {(x1,y1),::: , (xn, yn)}, this is surprising, and therefore it means some- 
thing. 

Occam’s Razor has been formally proved under different sets of idealized 
conditions. The above argument captures the essence of these proofs; if some- 
thing is less likely to happen, then when it does happen it is more significant. 
Let us look at an example. 


Example 5.1. Suppose that one constructs a physical theory about the re- 
sistivity of a metal under various temperatures. In this theory, aside from 
some constants that need to be determined, the resistivity p has a linear de- 
pendence on the temperature T. In order to verify that the theory is correct 
and to obtain the unknown constants, 3 scientists conduct the following three 
experiments and present their data to you. 


resistivity p 
resistivity p 
resistivity p 


temperature T temperature T temperature T 
Scientist 1 Scientist 2 Scientist 3 
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It is clear that Scientist 3 has produced the most convincing evidence for the 
theory. If the measurements are exact, then, Scientist 2 has managed to falsify 
the theory and we are back to the drawing board. What about Scientist 1? 
While he has not falsified the theory, has he provided any evidence for it? The 
answer is no, for we can reverse the question. Suppose that the theory was not 
correct, what could the data have done to prove him wrong? Nothing, since 
any two points can be joined by a line. Therefore, the model is not just likely 
to fit the data in this case, it is certain to do so. This renders the fit totally 
insignificant when it does happen. O 


This example illustrates a concept related to Occam’s Razor, which is the 
axiom of non-falsifiability. The axiom asserts that the data should have some 
chance of falsifying a hypothesis, if we are to conclude that it can provide 
evidence for the hypothesis. One way to guarantee that every data set has 
some chance at falsification is for the VC dimension of the hypothesis set 
to be less than N, the number of data points. This is discussed further in 
Problem 5.1. Here is another example of the same concept. 


Example 5.2. Financial firms try to pick good traders (predictors of whether 
the market will go up or not). Suppose that each trader is tested on their 
prediction (up or down) over the next 5 days and those who perform well will 
be hired. One might think that this process should produce better and better 
traders on Wall Street. Viewed as a learning problem, consider each trader 
to be a prediction hypothesis. Suppose that the hiring pool is ‘complex’; we 
are interviewing 2° traders who happen to be a diverse set of people such that 
their predictions over the next 5 days are all different. Necessarily one of these 
traders gets it all correct, and will be hired. Hiring the trader through this 
process may or may not be a good thing, since the process will pick someone 
even if the traders are just flipping coins to make their predictions. A perfect 
predictor always exists in this group, so finding one doesn’t mean much. If we 
were interviewing only two traders, and one of them made perfect predictions, 
that would mean something. E 


Exercise 5.2 


Suppose that for 5 weeks in a row, a letter arrives in the mail that predicts 
the outcome of the upcoming Monday night football game. You keenly 
watch each Monday and to your surprise, the prediction is correct each 
time. On the day after the fifth game, a letter arrives, stating that if you 
wish to see next week's prediction, a payment of $50.00 is required. Should 
you pay? 

(a) How many possible predictions of win-lose are there for 5 games? 


(b) If the sender wants to make sure that at least one person receives 
correct predictions on all 5 games from him, how many people should 
he target to begin with? 
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(c) After the first letter ‘predicting’ the outcome of the first game, how 
many of the original recipients does he target with the second letter? 


(d) How many letters altogether will have been sent at the end of the 5 
weeks? 

(e) If the cost of printing and mailing out each letter is $0.50, how much 
would the sender make if the recipient of 5 correct predictions sent in 
the $50.00? 

(f). Can you relate this situation to the growth function and the credibility 
of fitting the data? 


Learning from data takes Occam’s Razor to another level, going beyond “as 
simple as possible, but no simpler.” Indeed, we may opt for ‘a simpler fit 
than possible’, namely an imperfect fit of the data using a simple model over 
a perfect fit using a more complex one. The reason is that the price we pay 
for a perfect fit in terms of the penalty for model complexity in (2.14) may 
be too much in comparison to the benefit of the better fit. This idea was 
illustrated in Figure 3.7, and is a manifestation of overfitting. The idea is also 
the rationale behind the recommended policy in Chapter 3: first try a linear 
model — one of the simplest models in the arena of learning from data. 


5.2 Sampling Bias 


A vivid example of sampling bias happened in the 1948 US presidential election 
between Truman and Dewey. On election night, a major newspaper carried 
out a telephone poll to ask people how they voted. The poll indicated that 
Dewey won, and the paper was so confident about the small error bar in its 
poll that it declared Dewey the winner in its headline. When the actual votes 
were counted, Dewey lost — to the delight of a smiling Truman. 
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This was not a case of statistical anomaly, where the newspaper was just 
incredibly unlucky (remember the 6 in the VC bound?). It was a case where 
the sample was doomed from the get-go, regardless of its size. Even if the 
experiment were repeated, the result would be the same. In 1948, telephones 
were expensive and those who had them tended to be in an elite group that 
favored Dewey much more than the average voter did. Since the newspaper did 
its poll by telephone, it inadvertently used an in-sample distribution that was 
different from the out-of-sample distribution. That is what sampling bias is. 


If the data is sampled in a biased way, learning will pro- 
duce a similarly biased outcome. 


Applying this principle, we should make sure that the training and testing 
distributions are the same; if not, our results may be invalid, or, at the very 
least, require careful interpretation. . 

If you recall, the VC analysis made very few assumptions, but one as- 
sumption it did make was that the data set D is generated from the same 
distribution that the final hypothesis g is tested on. In practice, we may en- 
counter data sets that were not generated under those ideal conditions. There 
are some techniques in statistics and in learning to compensate for the ‘mis- 
match’ between training and testing, but not in cases where D was generated 
with the exclusion of certain parts of the input space, such as the exclusion of 
households with no telephones in the above example. There is nothing that 
can be done when this happens, other than to admit that the result will not 
be reliable — statistical bounds like Hoeffding and VC require a match between 
the training and testing distributions. 

There are many examples of how sampling bias can be introduced in data 
collection. In some cases it is inadvertently introduced by an oversight, as 
in the case of Dewey and Truman. In other cases, it is introduced because 
certain types of data are not available. For instance, in our credit example of 
Chapter 1, the bank created the training set from the database of previous cus- 
tomers and how they performed for the bank. Such a set necessarily excludes 
those who applied to the bank for credit cards and were rejected, because the 
bank does not have data on how they would have performed if they were ac- 
cepted. Since future applicants will come from a mixed population including 
some who would have been rejected in the past, the ‘test set’ comes from a 
different distribution than the training set, and we have a case of sampling 
bias. In this particular case, if no data on the applicants that were rejected is 
available, nothing much can be done other than to acknowledge that there is 
a bias in the final predictor that learning will produce, since a representative 
training set is just not available. 


Exercise 5.3 


In an experiment to determine the distribution of sizes of fish in a lake, a 
net might be used to catch a representative sample of fish. The sample is 
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then analyzed to find out the fractions of fish of different sizes. If the 
sample is big enough, statistical conclusions may be drawn about the actual 


distribution in the entire lake. Can you smell ©) sampling bias? 


There are other cases, arguably more common, where sampling bias is intro- 
duced by human intervention. It is not that uncommon for someone to throw 
away training examples they don’t like! A Wall Street firm who wants to de- 
velop an automated trading system might choose data sets when the market 
was ‘behaving well’ to train the system, with the semi-legitimate justification 
that they don’t want the noise to complicate the training process. They will 
surely achieve that if they get rid of the ‘bad’ examples, but they will create a 
system that can be trusted only in the periods when the market does behave 
well! What happens when the market is not behaving well is anybody’s guess. 
In general, throwing away training examples based on their values, e.g., ex- 
amples that look like outliers or don’t conform to our preconceived ideas, is a 
fairly common sampling bias trap. 


Other biases. Sampling bias has also been called selection bias in the statis- 
tics community. We will stick with the more descriptive term sampling bias 
for two reasons. First, the bias arises in how the data was sampled; second, it 
is less ambiguous because in the learning context, there is another notion of 
selection bias drifting around — selection of a final hypothesis from the learning 
model based on the data. The performance of the selected hypothesis on the 
data is optimistically biased, and this could be denoted as a selection bias. 
We have referred to this type of bias simply as bad generalization. 

There are various other biases that have similar flavor. There is even 
a special type of bias for the research community, called publication bias! 
This refers to the bias in published scientific results because negative results 
are often not published in the literature, whereas positive results are. The 
common theme of all of these biases is that they render the standard statistical 
conclusions invalid because the basic premise for such conclusions, that the 
sampling distribution is the same as the overall distribution, does not hold 
any more. In the field of learning from data, it is sampling bias in the training 
set that we need to worry about. 


5.3 Data Snooping 


Data snooping is the most common trap for practitioners in learning from 
data. The principle involved is simple enough, 


If a data set has affected any step in the learning process, 
its ability to assess the outcome has been compromised. 
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Applying this principle, if you want an unbiased assessment of your learning 
performance, you should keep a test set in a vault and never use it for learning 
in any way. This is basically what we have been talking about all along in 
training versus testing, but it goes beyond that. Even if a data set has not been 
‘physically’ used for training, it can still affect the learning process, sometimes 
in subtle ways. 


Exercise 5.4 


Consider the following approach to learning. By looking at the data, it 
appears that the data is linearly separable, so we go ahead and use a simple 
perceptron, and get a training error of zero after determining the optimal 
set of weights. We now wish to make some generalization conclusions, so 
we look up the dvo for our learning model and see that it is d+1. Therefore, 
we use this value of dvo to get a bound on the test error. 


(a) What is the problem with this bound - is it correct? 


(b) Do we know the dvo for the learning model that we actually used? It 
is this dve that we need to use in the bound. 


To avoid the pitfall in the above exercise, it is extremely important that you 
choose your learning model before seeing any of the data. The choice can be 
based on general information about the learning problem, such as the num- 
ber of data points and prior knowledge regarding the input space and target 
function, but not on the actual data set D. Failure to observe this rule will 
invalidate the VC bounds, and any generalization conclusions will be up in the 
air. Even a careful person can fall into the traps of data snooping. Consider 
the following example. 


Example 5.3. An investment bank wants to develop a system for forecasting 
currency exchange rates. It has 8 years worth of historical data on the US 
Dollar (USD) versus the British Pound (GBP), so it tries to use the data to see 
if there is any pattern that can be exploited. The bank takes the series of daily 
changes in the USD/GBP rate, normalizes it to zero mean and unit variance, 
and starts to develop a system for forecasting the direction of the change. For 
each day, it tries to predict that direction based on the fluctuations in the 
previous 20 days. 75% of the data is used for training, and the remaining 25% 
is set aside for testing the final hypothesis. 


The test shows great success. The final hypothesis has a hit rate (per- 
centage of time getting the direction right) of 52.1%. This may seem modest, 
but in the world of finance you can make a lot of money if you get that 
hit rate consistently. Indeed, over the 500 test days (2 years worth, as each 
year has about 250 trading days), the cumulative profit of the system is a 
respectable 22%. 
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When the system is used in live trading, the performance deteriorates sig- 
nificantly. In fact, it loses money. Why didn’t the good test performance 
continue on the new data? In this case, there is a simple explanation and it 
has to do with data snooping. Although the bank was careful to set aside 
test points that were not used for training in order to properly evaluate the 
final hypothesis, the test data had in fact affected the training process in a 
subtle way. When the original series of daily changes was normalized to zero 
mean and unit variance, all of the data was involved in this step. Therefore, 
the test data that was extracted had already contributed to the choices made 
by the learning algorithm by contributing to the values of the mean and the 
variance that were used in normalization. Although this seems like a minor 
effect, it is data snooping. When you plot the cumulative profit on the test 
set with or without that snooping step, you see how snooping resulted in an 
over-optimistic expectation compared to the realistic expectation that avoids 
snooping. 

It is not the normalization that was a bad idea. It is the involvement of 
test data in that normalization, which contaminated this data and rendered 
its estimate of the final performance inaccurate. 0O 


One of the most common occurrences of data snooping is the reuse of the 
same data set. If you try learning using first one model and then another and 
then another on the same data set, you will eventually ‘succeed’. As the saying 
goes, if you torture the data long enough, it will confess ©). If you try all 
possible dichotomies, you will eventually fit any data set; this is true whether 
we try the dichotomies directly (using a single model) or indirectly (using a 
sequence of models). The effective VC dimension for the series of trials will 
not be that of the last model that succeeded, but of the entire union of models 
that could have been used depending on the outcomes of different trials. 

Sometimes the reuse of the same data set is carried out by different people. 
Let’s say that there is a public data set that you would like to work on. Before 
you download the data, you read about how other people did with this data set 
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using different techniques. You naturally pick the most promising techniques 
as a baseline, then try to improve on them and introduce your own ideas. 
Although you haven’t even seen the data set yet, you are already guilty of 
data snooping. Your choice of baseline techniques was affected by the data 
set, through the actions of others. You may find that your estimates of the 
performance will turn out to be too optimistic, since the techniques you are 
using have already proven well-suited to this particular data set. 

To quantify the damage done by data snooping, one has to assess the 
penalty for model complexity in (2.14) taking the snooping into consideration. 
In the public data set case, the effective VC dimension corresponds to a much 
bigger hypothesis set than the H that your learning algorithm uses. It covers 
all hypotheses that were considered (and mostly rejected) by everybody else 
in the process of coming up with the solutions that they published and that 
you used as your baseline. This is a potentially huge set with very high VC 
dimension, hence the generalization guarantees in (2.14) will be much worse 
than without data snooping. 

Not all data sets subjected to data snooping are equally ‘contaminated’. 
The bounds in (1.6) in the case of a choice between a finite number of hy- 
potheses, and in (2.12) in the case of an infinite number, provide guidelines 
for the level of contamination. The more elaborate the choice made based on 
a data set, the more contaminated the set becomes and the less reliable it will 
be in gauging the performance of the final hypothesis. 


Exercise 5.5 


Assume we set aside 100 examples from D. that will not be used in training, 
but will be used to select one of three final hypotheses g1, g2, g3 produced by 
three different learning algorithms that train on the rest on the data. Each 
algorithm works with a different H of size 500. We would like to characterize 
the accuracy of estimating Eout(g) on the selected final hypothesis if we 
use the same 100 examples to make that estimate. 


(a) What is the value of M that should be used in (1.6) in this situation? 
(b) How does the level of contamination of these 100 examples compare 


to the case where they would be used in training rather than in the 
final selection? 


In order to deal with data snooping, there are basically two approaches. 


1. Avoid data snooping: A strict discipline in handling the data is required. 
Data that is going to be used to evaluate the final performance should 
be ‘locked in a safe’ and only brought out after the final hypothesis has 
been decided. If intermediate tests are needed, separate data sets should 
be used for that. Once a data set has been used, it should be treated as 
contaminated as far as testing the performance is concerned. 


2. Account for data snooping: If you have to use a data set more than 
once, keep track of the level of contamination and treat the reliability of 
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your performance estimates in light of this contamination. The bounds 
(1.6) and (2.12) can provide guidelines for the relative reliability of dif- 
ferent data sets that have been used in different roles within the learning 
process. 


Data snooping versus sampling bias. Sampling bias was defined based 
on how the data was obtained before any learning; data snooping was defined 
based on how the data affected the learning, in particular how the learning 
model is selected. These are obviously different concepts. However, there are 
cases where sampling bias occurs as a consequence of ‘snooping’ — looking at 
data that you are not supposed to look at. Here is an example. 

Consider predicting the performance of different stocks based on historical 
data. In order to see if a prediction rule is any good, you take all currently 
traded companies and test the rule on their stock data over the past 50 years. 
Let us say that you are testing the “buy and hold” strategy, where you would 
have bought the stock 50 years ago and kept it until now. If you test this 
‘hypothesis’, you will get excellent performance in terms of profit. Well, don’t 
get too excited! You inadvertently biased the results in your favor by picking 
only currently traded companies, which means that the companies that did 
not make it are not part of your evaluation. When you put your prediction 
rule to work, it will be used on all companies whether they will survive or 
not, since you cannot identify which companies today will be the ‘currently 
traded’ companies 50 years from now. This is a typical case of sampling bias, 
since the problem is that the training data is not representative of the test 
data. However, if we trace the origin of the bias, we did ‘snoop’ in this case by 
looking at future data of companies to determine which of these companies to 
use in our training. Since we are using information in training that we would 
not have access to in real trading, this is viewed as a form of data snooping. 
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5.4 Problems 


Problem 5.1 The idea of falsifiability — that a claim can be rendered 
false by observed data — is an important principle in experimental science. 





Axiom of Non-Falsifiability. If the outcome of an experiment 
has no chance of falsifying a particular proposition, then the result 
of that experiment does not provide evidence one way or another 
toward the truth of the proposition. 











Consider the proposition “There is h € H that approximates f as would be 
evidenced by finding such an h with in-sample error zero on x1,°-: , Xn." We 
say that the proposition is falsified if no hypothesis in H can fit the data 
perfectly. 


(a) Suppose that H shatters x1,--- , xmn. Show that this proposition is not 
falsifiable for any f. 


(b) Suppose that f is random (f(x) = +1 with probability 5, independently 
on every x), so Eou (h) = i for every h € H. Show that 


_ mu(N) 


P{falsification] > 1 9N 


(c) Suppose dvo = 10 and N = 100. If you obtain a hypothesis h with zero 
Ein on your data, what can you ‘conclude’ from the result in part (b)? 


Problem 5.2 Structural Risk Minimization (SRM) is a useful framework 
for model selection that is related to Occam's Razor. Define a structure — a 
nested sequence of hypothesis sets: 


Hı Ho H3 


The SRM framework picks a hypothesis from each H; by minimizing Ein. 
That is, gi = argmin Ein(h). Then, the framework selects the final hy- 
hEH; 


pothesis by minimizing Ein and the model complexity penalty Q. That is, 
g* = argmin(Fin(gi) + Q(H:)). Note that Q(H;) should be non-decreasing in i 


=1,2,::- 


because of the nested structure. 


(a) Show that the in-sample error Fin(gi) is non-increasing in i. 
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(b) Assume that the framework finds g* € Hi with probability p;. How does 
pi relate to the complexity of the target function? 


(c) Argue that the p;'s are unknown but po < pi < po <- <1. 





(d) Suppose g* = g;. Show that 
* 1 —e? 
P [|Bin(gi) — Eout(gi)| > € | 9° = gi] < z “dm, (2N)e © ~, 


Here, the conditioning is on selecting g; as the final hypothesis by SRM. 
[Hint: Use the Bayes theorem to decompose the probability and then 
apply the VC bound on one of the terms] 


You may interpret this result as follows: if you use SRM and end up with gi, 
then the generalization bound is a factor + worse than the bound you would 
have gotten had you simply started with Hi. 


Problem 5.3 In our credit card example, the bank starts with some vague 
idea of what constitutes a good credit risk. So, as customers x1,X2,...,XN 
arrive, the bank applies its vague idea to approve credit cards for some of these 
customers. Then, only those who got credit cards are monitored to see if they 
default or not. 


For simplicity, suppose that the first N customers were given credit cards. 
Now that the bank knows the behavior of these customers, it comes to you 
to improve their algorithm for approving credit. The bank gives you the data 
(X1,Y1),-++) (XN, YN). 

Before you look at the data, you do mathematical derivations and come up with 
a credit approval function. You now test it on the data and, to your delight, 
obtain perfect prediction. 


(a) What is M, the size of your hypothesis set? 


(b) With such an M, what does the Hoeffding bound say about the probability 
that the true performance is worse than 2% error for N = 10000? 


(c) You give your g to the bank and assure them that the performance will 
be better than 2% error and your confidence is given by your answer 
to part (b). The bank is thrilled and uses your g to approve credit for 
new clients. To their dismay, more than half their credit cards are being 
defaulted on. Explain the possible reason(s) behind this outcome. 


(d) Is there a way in which the bank could use your credit approval function 
to have your probabilistic guarantee? How? [Hint: The answer is yes!] 
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Problem 5.4 The S&P 500 is a set of the largest 500 companies currently 
trading. Suppose there are 10, 000 stocks currently trading, and there have been 
50, 000 stocks which have ever traded over the last 50 years (some of these have 
gone bankrupt and stopped trading). We wish to evaluate the profitability of 
various ‘buy and hold’ strategies using these 50 years of data (roughly 12,500 
trading days). 

Since it is not easy to get stock data, we will confine our analysis to today’s 
S&P 500 stocks, for which the data is readily available. 


(a) A stock is profitable if it went up on more than 50% of the days. Of your 
S&P stocks, the most profitable went up on 52% of the days (Ein = 0.48). 


(i) Since we picked the best among 500, using the Hoeffding bound, 


P[|Ein — Eout| > 0.02] < 2 x 500 x e72%12500%0.02" 9 45, 


There is a greater than 95% chance this stock is profitable. Where 
did we go wrong? 

(ii) Give a better estimate for the probability that this stock is profitable. 
[Hint: What should the correct M be in the Hoeffding bound?] 


(b) We wish to evaluate the profitability of ‘buy and hold’ for general stock 
trading. We notice that all of our 500 S&P stocks went up on at least 51% 
of the days. 

(i) We conclude that buying and holding a stocks is a good strategy for 
general stock trading. Where did we go wrong? 
(ii) Can we say anything about the performance of buy and hold trading? 


Problem 5.5 You think that the stock market exhibits reversal, so if 
the price of a stock sharply drops you expect it to rise shortly thereafter. If it 
sharply rises, you expect it to drop shortly thereafter. 


To test this hypothesis, you build a trading strategy that buys when the stocks 
go down and sells in the opposite case. You collect historical data on the current 
S&P 500 stocks, and your hypothesis gave a good annual return of 12%. 


(a) When you trade using this system, do you expect it to perform at this 
level? Why or why not? 


(b) How can you test your strategy so that its performance in sample is more 
reflective of what you should expect in reality? 


Problem 5.6 One often hears “Extrapolation is harder than interpolation.” 
Give a possible explanation for this phenomenon using the principles in this 
chapter. [Hint: training distribution versus testing distribution.] 
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Epilogue 


This book set the stage for a deeper exploration into Learning From Data by 
developing the foundations. It is possible to learn from data, and you have 
all the basic tools to do so. The linear model coupled with the right features 
and an appropriate nonlinear transform, together with the right amount of 
regularization, pretty much puts you into the thick of the game, and you will 
be in good stead as long as you keep in mind the three basic principles: simple 
is better (Occam’s razor), avoid data snooping and beware of sampling bias. 

Where to go from here? There are two main directions. One is to learn 
more sophisticated learning techniques, and the other is to explore different 
learning paradigms. Let us preview these two directions to give the reader a 
better understanding of the ‘map’ of learning from data. 

The linear model can be used as a building block for other popular tech- 
niques. A cascade of linear models, mostly with soft thresholds, creates a 
neural network. A robust algorithm for linear models, based on quadratic 
programming, creates support vector machines. An efficient approach to non- 
linear transformation in support vector machines creates kernel methods. A 
combination of different models in a principled way creates boosting and en- 
semble learning. There are other successful models and techniques, and more 
to come for sure. 

In terms of other paradigms, we have briefly mentioned unsupervised learn- 
ing and reinforcement learning. There is a wealth of techniques for these learn- 
ing paradigms, including methods that mix labeled and unlabeled data. Active 
learning and online learning, which we also mentioned briefly, have their own 
techniques and theories. In addition, there is a school of thought that treats 
learning as a completely probabilistic paradigm using a Bayesian approach, 
and there are useful probabilistic techniques such as Gaussian processes. Last 
but not least, there is a school that treats learning as a branch of the theory 
of computational complexity, with emphasis on asymptotic results. 

Of course, the ultimate test of any engineering discipline is its impact in 
real life. There is no shortage of successful applications of learning from data. 
Some of the application domains have specialized techniques that are worth 
exploring, e.g., computational finance and recommender systems. 

Learning from data is a very dynamic field. Some of the hot techniques 
and theories at times become just fads, and others gain traction and become 
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part of the field. What we have emphasized in this book are the necessary 
fundamentals that give any student of learning from data a solid foundation, 
and enable him or her to venture out and explore further techniques and 
theories, or perhaps to contribute their own. 
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Appendix 


Proof of the VC Bound 


In this Appendix, we present the formal proof of Theorem 2.5. It is a fairly 
elaborate proof, and you may skip it altogether and just take the theorem for 
granted, but you won’t know what you are missing ©) ! 


Theorem A.1 (Vapnik, Chervonenkis, 1971). 


P | sup |Ein(h) — Eout(h)| > €| < 4may(2N)e~ 8°, 
heH 


This inequality is called the VC Inequality, and it implies the VC bound of 
Theorem 2.5. The inequality is valid for any target function (deterministic 
or probabilistic) and any input distribution. The probability is over data 
sets of size N. Each data set is generated iid (independent and identically 
distributed), with each data point generated independently according to the 
joint distribution P(x,y). The event suppey |Fin(h) — Bout (h)| > € is equiva- 
lent to the union over all h € H of the events |Ein(h)— Eout(h)| > €; this union 
contains the event that involves g in Theorem 2.5. The use of the supremum (a 
technical version of the maximum) is necessary since H can have a continuum 
of hypotheses. 

The main challenge to proving this theorem is that Eout(h) is difficult to 
manipulate compared to Fin(h), because Eout(h) depends on the entire input 
space rather than just a finite set of points. The main insight needed to over- 
come this difficulty is the observation that we can get rid of Eout(h) altogether 
because the deviations between Ei, and Foyt can be essentially captured by 
deviations between two in-sample errors: Ej, (the original in-sample error) 
and the in-sample error on a second independent data set (Lemma A.2). We 
have seen this idea many times before when we use a test or validation set to 
estimate Eout. This insight results in two main simplifications: 


1. The supremum of the deviations over infinitely many h € H can be 
reduced to considering only the dichotomies implementable by H on the 
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two independent data sets. That is where the growth function mz(2N) 
enters the picture (Lemma A.3). 


2. The deviation between two independent in-sample errors is ‘easy’ to an- 
alyze compared to the deviation between Ej, and Four (Lemma A.4). 


The combination of Lemmas A.2, A.3 and A.4 proves Theorem A.1. 


A.1 Relating Generalization Error to In-Sample 
Deviations 


Let’s introduce a second data set D’, which is independent of D, but sampled 
according to the same distribution P(x,y). This second data set is called a 
ghost data set because it doesn’t really exist; it is a just a tool used in the 
analysis. We hope to bound the term P[| Ein — Eout| is large] by another term 
P[|Ein — E{,| is large], which is easier to analyze. 

The intuition behind the formal proof is as follows. For any single hypoth- 
esis h, because D’ is fresh, sampled independently from P(x, y), the Hoeffding 
Inequality guarantees that Ep (A) ~ Eout(h) with a high probability. That 
is, when |Fin(h) — Eout(h)| is large, with a high probability |Ein(h) — E/,(h)| 
is also large. Therefore, P||Ein (A) — Eout(h)| is large] can be approximately 
bounded by P[|E£in(h) — Ej (A)| is large]. 

We are trying to bound the probabil- 
ity that Fin is far from Eout. Let El (h) 
be the ‘in-sample’ error for hypothesis h 
on D’. Suppose that Fi» is far from Eout 
with some probability (and similarly Æ’ 
is far from Eout, with that same prob- 
ability, since Ein and Ej, are identically 
distributed). When N is large, the proba- 
bility is roughly Gaussian around Four, as Eon 
illustrated in the figure to the right. The = 
red region represents the cases when Ei, 
is far from Eout. In those cases, Ej, is far from Fi, about half the time, 
as illustrated by the green region. That is, P||Ein — Eout| is large] can be 
approximately bounded by 2 P [|Fin — E7/,| is large]. 

This argument provides some intuition that the deviations between Ein 
and Eout can be captured by the deviations between Ein and Ey. The argu- 
ment can be carefully extended to multiple hypotheses. 


in 


t 





Probability distribution 
of Ein, 


Lemma A.2. 

(a ~ 2e-+êN) Pl sup |Ein (A) — Eou (A)| > el <P sup |Fin(h) — Ei, (h)| > $|, 
hex. heH 

where the probability on the RHS is over D and D’ jointly. 
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Proof. We can assume that P| sup |Ein(h) — Eout(h)| > e| > 0, otherwise 


there is nothing to prove. 


P [sup |Fin(h) — Ej, (h)| > s 


IV 


P | sup |Ein(h) — En(h)| > $ and ai |Ein(h) — Eout(h)| > e| (A.1) 


i 
P P |Ein(h) — Eou (h)| > el x 
Ha 


up |Ein(h) — Ein (h)| > $ | sup |Zin(h) — Eout(h)| > e] s 
hEH 





Inequality (A.1) follows because P[B,] > 


B,, Bz. Now, let’s consider the last term: 


P[B; and B2] for any two events 


P sup |Fin(h) — En (h)| > $ | sup |Zin(h) — Eou (h)| > el i 
hEH heH. 





The event on which we are conditioning is a set of data sets with non-zero 
probability. Fix a data set D in this event. Let h* be any hypothesis for 
which |Fin(h*) — Eout(h*)| > e. One such hypothesis must exist given that D 
is in the event on which we are conditioning. The hypothesis h* does not 
depend on D’, but it does depend on D. 


sup |Fin(h) — Eous(h)| > el 
heH 





P sup |Ein(h) — Ei, (h)| > § 
heH. 








> P ||Ba(h") - Balh) > $ | sup Enl) -Eoul ><] (A.2) 
hEH 
-i iea ~ Eout(h*)| < 5 | sup |Ein (A) — Eout(h)| > el (A.3) 
hEeH 
Bae ae (A.4) 
1. Inequality (A.2) see because the event “|Fin(h*) — Ej, (h*)| > §” 
implies “sup |Fin(h) — Ei, (A)| > §”. 
heH 
2. Inequality (A.3) follows because the events “|E/,(h*) — ere er aig 
and “|Bin(h*) — Eout(h*)| > @ (which is given) imply “|Fin(h) — E/,(h)| > 
oa 


3. Inequality (A.4) follows because h* is fixed with respect to D’ and so we 
can apply the Hoeffding Inequality to P[|Ej,(h*) — Eout(h*)| < 51. 


Notice that the Hoeffding Inequality applies to P[|Ej,(h*) — Eout(h*)| < §] 
for any h*, as long as h* is fixed with respect to D’. Therefore, it also applies 
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to any weighted average of P||E£/,(h*) — Bout(h*)| < $] based on h*. Finally, 
since h* depends on a particular D, we take the weighted average over all D 


in the event 
“ sup |Ein(h) — Eout(h)| > 
hEH 


on which we are conditioning, where the weight comes from the probability of 
the particular D. Since the bound holds for every D in this event, it holds for 
the weighted average. E 


Note that we can assume e7~2°% < L, because otherwise the bound in 
Theorem A.1 is trivially true. In this case, 1 — 273 N > 4, so the lemma 


implies 


P | sup |Fin(h) — Eoutr(h)| > el <2P sup |Ein(h) — Enh) > $|. 
heH hex. 


A.2 Bounding Worst Case Deviation Using the 
Growth Function 


Now that we have related the generalization error to the deviations between 
in-sample errors, we can actually work with H restricted to two data sets of 
size N each, rather than the infinite H. Specifically, we want to bound 


P sup |Ein(h) — Ei, (h)| > €], 
heH 


where the probability is over the joint distribution of the data sets D and D”. 
One equivalent way of sampling two data sets D and D’ is to first sample a 
data set S of size 2N, then randomly partition S into D and D’. This amounts 
to randomly sampling, without replacement, N examples from S for D, leaving 
the remaining for D’. Given the joint data set S, let 

l 


be the probability of deviation between the two in-sample errors, where the 
probability is taken over the random partitions of S into D and D’. By the 
law of total probability (with ` denoting sum or integral as the case may be), 


l 


P sup [Ein(h) — Bf, (h)| > $ 
hEeH 





P sup | Fin (h) A Ein (h)| > s| 
hEH 





ne xP sup |Ein(h) — Eig(h)| > $ 


s|. 


IA 


sup P sup | Fin (h) i E;,,(h)| > 5 
S hEH 
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Let H(S) be the dichotomies that H can implement on the points in S. By 
definition of the growth function, H(S) cannot have more than m(2N) di- 
chotomies. Suppose it has M < m,(2N) dichotomies, realized by h1,..., hm. 
Thus, 
sup |Bin(h) — Ein(h)| = sup | Ein(h) — Ei, (A)]. 
REH hE{hi, nahm} 
5 


= P | sup |Fin(h) — Ein(h)| > § 


Then, 


P sup | Ein(h) "z Ex, (h)| > 5 
hEH 








i 


hE{hi, hm} 
M 
< J P[||Enlhm) — Enlhm)l > $ |5] (A.5) 
m=1 
< Mx supP [|En (h) — Bi(h)| > $ |5], (A.6) 
hEH 


where we use the union bound in (A.5), and overestimate each term by the 
supremum over all possible hypotheses to get (A.6). After using M < mz(2N) 
and taking the sup operation over S, we have proved: 


Lemma A.3. 


P sup |Fin(h) — Ei, (h)| > s 
hEH 
< my(2N) x sup sup P [|En (h) — Eh (h)| > $ |3], 
S hEH 
where the probability on the LHS is over D and D’ jointly, and the probability 
on the RHS is over random partitions of S into two sets D and D’. 


The main achievement of Lemma A.3 is that we have pulled the supre- 
mum over h € H outside the probability, at the expense of the extra factor 
of my(2N). 


A.3 Bounding the Deviation between In-Sample 
Errors 


We now address the purely combinatorial problem of bounding 


sup sup P ||Ein(h) — Ej, (h)| > 5 | aP 
S hEH 
which appears in Lemma A.3. We will prove the following lemma. Then, 
Theorem A.1 can be proved by combining Lemmas A.2, A.3 and A.4 taking 


1—2e726N > š (the only case we need to consider). 
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Lemma A.4. For any h and any S, 
P [|Fin(h) — Ei, (h)| > $ |S] < 27N, 
where the probability is over random partitions of S into two sets D and D’. 


Proof. To prove the result, we will use a result, which is also due to Hoeffding, 
for sampling without replacement: 


Lemma A.5 (Hoeffding, 1963). Let A = {a1,...,@2nw} be a set of values with 
an E [0,1], and let u = = pan an be their mean. Let D = {z1,..., zn} be 
a sample of size N, sampled from A uniformly without replacement. Then 


1 N 


We apply Lemma A.5 as follows. For the 2N examples in S, let a, = 1 if 
h(xn) # Yn and an = 0 otherwise. The {an} are the errors made by h on S. 
Now randomly partition S into D and D’, i.e., sample N examples from S 
without replacement to get D, leaving the remaining N examples for D’. This 
results in a sample of size N of the {a,} for D, sampled uniformly without 
replacement. Note that 


> | < Je 2 N 








1 1 
Ein(h) = 5 X an, and Ej,(h) = NO a 


an ED a ED’ 


Since we are sampling without replacement, S = D U D' and DAO D = 9, and 
so 
2N 
ate _ Enn(h) + Ein (?) 
= 5g DL On = 2 
n=1 
It follows that |Fin — u| >t — > |En — Ehl > 2t. By Lemma A.5, 
P ||Hin(h) — El,(h)| > 2t] < 2672, 


Substituting t = $ gives the result. E 
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Notation 


event (in probability) 

set 

absolute value of a number, or cardinality (number of ele- 
ments) of a set, or determinant of a matrix 

square of the norm; sum of the squared components of a 
vector 

floor; largest integer which is not larger than the argument 
the interval of real numbers from a to b 

evaluates to 1 if argument is true, and to 0 if it is false 
gradient operator, e.g., VEin (gradient of Ein(w) with re- 
spect to w) 

inverse 

pseudo-inverse 

transpose (columns become rows and vice versa) 


number of ways to choose k objects from N distinct objects 
(equals NEY where ‘l’ is the factorial) 

the set A with the elements from set B removed 

zero vector; a column vector whose components are all zeros 
d-dimensional Euclidean space with an added ‘zeroth coor- 
dinate’ fixed to 1 

tolerance in approximating a target. 

bound on the probability of exceeding e (the approximation 
tolerance) 

learning rate (step size in iterative learning, e.g., in stochas- 
tic gradient descent) 

regularization parameter 

regularization parameter corresponding to weight budget 
C 

penalty for model complexity; either a bound on general- 
ization error, or a regularization term 

logistic function 6(s) = e°/(1 + e°) 

feature transform, z = ®(x) 

Qth-order polynomial transform 
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E[y|x] 
Eaug 
Ein, Eim(h) 


cv 


Eout, Eout (h) 


a coordinate in the feature transform Ẹ, z; = ¢;(x) 
probability of a binary outcome 

fraction of a binary outcome in a sample 

variance of noise 

learning algorithm 

the value of a at which the minimum of the argument is 
achieved 

an event (in probability), usually ‘bad’ event 

the bias term in a linear combination of inputs, also called 
Wo 

the bias term in bias-variance decomposition 

maximum number of dichotomies on N points with a break 
point k 

bound on the size of weights in the soft order constraint 
dimensionality of the input space Y = R? or ¥ = {1} x R? 
dimensionality of the transformed space Z 

VC dimension of hypothesis set H. 

data set D = (x1,41),-:: , (xn, yw); technically not a set, 
but a vector of elements (Xn, Yn). D is often the training 
set, but sometimes split into training and validation /test 
sets. 

subset of D used for training when a validation or test set 
is used. 

validation set; subset of D used for validation. 

error measure between hypothesis h and target function f 
exponent of x in the natural base e = 2.71828 --- 
pointwise version of E(h, f), e.g., (h(x) — f(x))? 
leave-one-out error on example n when this nth example is 
excluded in training [cross validation] 

expected value of argument 

expected value with respect to x 

expected value of y given x 

augmented error (in-sample error plus regularization term) 
in-sample error (training error) for hypothesis h 

cross validation error 

out-of-sample error for hypothesis h 

out-of-sample error when D is used for training 

expected out-of-sample error 

validation error 

test error 

target function, f: ¥ > VY 

final hypothesis g € H selected by the learning algorithm, 
Gay 

final hypothesis when the training set is D 

average final hypothesis [bias-variance analysis] 
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L emona 
@ hej 


x 
z 


XN) 





final hypothesis when trained using D minus some points 
gradient, e.g., g = VEn 

a hypothesis h € H; h: X > V 

a hypothesis in transformed space Z 

hypothesis set 

hypothesis set that corresponds to perceptrons in ©- 
transformed space 

restricted hypothesis set by weight budget C' [soft order 
constraint] 

dichotomies (patterns of +1) generated by H on the points 
X1,°'' XN 

The hat matrix [linear regression] 

identity matrix; square matrix whose diagonal elements are 
1 and off-diagonal elements are 0 

size of validation set 

qth-order Legendre polynomial 

logarithm in base e 

logarithm in base 2 

number of hypotheses 

the growth function; maximum number of dichotomies gen- 
erated by H on any N points 

maximum of the two arguments 

number of examples (size of D) 

absolute value of this term is asymptotically negligible com- 
pared to the argument 

absolute value of this term is asymptotically smaller than 
a constant multiple of the argument 

(marginal) probability or probability density of x 
conditional probability or probability density of y given x 
joint probability or probability density of x and y 
probability of an event 

order of polynomial transform 

complexity of f (order of polynomial defining f) 

the set of real numbers 

d-dimensional Euclidean space 

signal s = wx = w:x; (i goes from 0 to d or 1 to d 
depending on whether x has the £o = 1 coordinate or not) 
sign function, returning +1 for positive and —1 for negative 
supremum, smallest value that is > the argument for all a 
number of iterations, number of epochs 

iteration number or epoch number 

hyperbolic tangent function; tanh(s) = (e*—e~*)/(e*+e7°) 
trace of square matrix (sum of diagonal elements) 

number of subsets in V-fold cross validation (V x K = N) 
direction in gradient descent (not necessarily a unit vector) 
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XOR 


< e 


N AL 


unit vector version of v [gradient descent] 

the variance term in bias-variance decomposition 

weight vector (column vector) 

weight vector in transformed space Z 

selected weight vector [pocket algorithm] 

weight vector that separates the data 

solution weight vector to linear regression 

regularized solution to linear regression with weight decay 
solution weight vector of perceptron learning algorithm 
added coordinate in weight vector w to represent bias b 
the input x € XY. Often a column vector x € R? or x € 
{1} x R?. z is used if input is scalar. 

added coordinate to x, fixed at £o = 1 to absorb the bias 
term in linear expressions 

input space whose elements are x € X 

matrix whose rows are the data inputs Xn [linear regression] 
exclusive OR function (returns 1 if the number of 1’s in its 
input is odd) 

the output y € V 

column vector whose components are the data set outputs 
Yn [linear regression] 

estimate of y [linear regression] 

output space whose elements are y € VY 

transformed input space whose elements are z = ®(x) 
matrix whose rows are the transformed inputs Zn = ®(x,) 
[linear regression] 
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Index 


active learning, 181 bound by cross-entropy error, 97 
definition, 12 bound by squared error, 97 
Adaline, 35, 110 clustering, 13 
approximation, 27 coin classification, 9, 13 
versus generalization, 62-68, 106 combinatorial optimization, 80 
artificial intelligence, 5 complexity 
augmented error, 132, 157 of H, 26 
axiom of non-falsifiability, 178 of f, 27 
computational complexity, 181 
B(N, k) computational finance, 181 
definition, 46 computer vision, 1 
lower bound, 69 convex function, 93 
upper bound, 48 convex set, 44 
backgammon, 12 cost, 28 
Bayes optimal decision theory, 10 cost matrix, 29, 115 
Bayes theorem, 33 credit approval, 3, 82, 96 
Bayesian learning, 181 cross validation, 145-150 
bias-variance, 62-66 V-fold, 150 
average function, 63 choosing A, 149 
dependence on N,d, 158 digits data, 151 
example, 65 effective number of examples, 163 
impact of noise, 125 exact computation, 149 
linear models, 158-159 leave-one-out, 146 
linear regression, 114 linear model, 149 
noisy target, 74 linear model, analytic, 164 
bin model, 18 model selection, 148 
multiple bins, 22 regularized, 165 
relationship to learning, 20 summary, 147 
binomial distribution, 36 unbiased, 147 
boosting, 181 variance, 162 
break point cross-entropy, 92 


definition, 45 
data contamination, 145, 151, 176 


Chebyshev inequality, 36 data mining, 15 
Chernoff bound, 37 data point, 3 
classification data set, 3 
for regression, 113 ghost, 188 
linear programming algorithm, 110 space of, 54 
classification error data snooping, 173-177, 181 
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financial trading, 174 
nonlinear transform, 103 
normalization bias, 174 
versus sampling bias, 177 
decision stump, 106 
design 
versus learning, 9 
deterministic noise, 124, 128 
effect on learning, 151 
regularization, 136 
similarity to stochastic noise, 136 
Dewey, 171 
dichotomy, 42 
maximum number, 46 
perceptron, 43 


table, 47 
differentiable, 85 
twice-, 93, 95 


effective number of hypotheses, 41, 53 
effective number of parameters, 52, 137, 
159 
Einstein, 167 
ensemble learning, 181 
entropy, 168 
error measure, 28-30 
Lı versus L2, 38 
classification, 28 
cross-entropy, 92 
fingerprint example, 28 
logistic regression, 91 
example, 3 


false accept, 29, 115 
false reject, 29, 115 
falsifiability, 178 
feasibility of learning 
Boolean example, 16 
probabilistic, 18 
two main questions, 26 
visual example, 15 
feature selection, 151 
feature space, 100 
features, 81 
nonlinear transform, 99 
feature transform, 100, 111, 116-117 
final exam, 39 
financial forecasting, 1 
fingerprint verification, 28, 115 


football scam, 170 


Gaussian processes, 181 
generalization, 39—59 
VC bound, 50-59 
VC dimension, 50 
generalization bound 
definition, 40 
Devroye, 73 
Parrondo and Van den Broek, 73 
Rademacher penalty, 73 
relative error, 74 
VC, see VC generalization bound 
generalization error 
definition, 40 
global minimum, 93 
gradient descent, 92-99 
algorithm, 95 
batch, 97 
initialization and termination, 95 
stochastic, 97 
growth function, 41-50 
2-dimensional perceptron, 43 
bound, 46-49 
convex set, 44 
definition, 42 
in VC proof, 190 
polynomial bound, 50 
positive interval, 44 
positive ray, 43 
two-dimensional perceptron, 43 


handwritten digit recognition, 4, 11, 81- 
82, 106-107, 151 
hat matrix, 87, 112 
Hessian matrix, 116 
Hoeffding bound, see Hoeffding Inequal- 
ity 
Hoeffding Inequality, 19, 19-27 
and binomial distribution, 36 
uniform version, 24 
without replacement, 192 
hypothesis set, 3 
composition, 72 
concentric spheres, 69 
convex set, 44 
monotonic, 71 
polynomial, 120 
positive interval, 44 
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positive ray, 43 

positive rectangles, 69 
positive-negative interval, 69 
positive-negative ray, 69 
restricted to inputs, 42 


in-sample error, 21 
input space, 3 
iterative learning, 7 


kernel methods, 181 


Lagrange multiplier, 131, 157 
lasso, 161 
law of large numbers, 36, 37 
learning 
criteria, 26, 78 
feasibility, 15-18, 24-26 
learning algorithm, 3 
learning curve, 66-68, 140, 147 
linear regression, 88 
learning model 
definition, 5 
learning problem 
summary figure, 30 
learning rate, 94, 95 
leave-one-out, 146 
Legendre polynomials, 123, 128-129, 154, 
155 
likelihood, 91 
linear classification, 77 
linear model, 77 
bias-variance, 158-159 
building block, 181 
cross validation, analytic, 164 
optimal weight decay, 161 
overlooked resource, 107 
summary, 96 
linear programming, 110, 111 
linear regression, 82-88, 111 
algorithm, 86 
bias and variance, 114 
for classification, 96-97, 109-110 
learning curve, 88 
optimal hypothesis, 111 
out of sample, 87—88 
out-of-sample error, 112 
projection matrix, 86, 113 
rank deficient, 114 


using classification algorithm, 113 

linearly separable, 6, 78 
example, 6 

local minimum, 93 

logistic function, 89 

logistic regression, 88-99 
algorithm, 95 
cross-entropy error, 92 
error measure, 91-92 
for classification, 96-97, 115 
hard threshold, 115 
initialization, 95 
optimal decision theory, 115 
termination, 96 

loss matrix, 38 


machine learning, vii, 14 
maximum likelihood, 91 
medical diagnosis, 1 
minimum description length, 168 
model selection, 141-145 
choosing A, 134, 149 
cross validation, 148 
experiment, 144 
summary, 143 
monotonic functions, 71 
VC dimension, 71 
movie rating, 1-3 
multiclass, 81 


Netflix, 1 
neural network, 181 
Newton’s method, 116 
noise 

deterministic, 124 

stochastic, 124 
non-falsifiability, 178 

axiom, 170 

picking financial traders, 170 
non-separable data, 79-81 
nonlinear regression, 104 
nonlinear transformation, 99 
normalization, 175 
NP-hard, 80 


objective, 28 

Occam’s razor, 167—171, 181 
off training set error, 37 

Q, 58 
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online learning, 98, 181 
definition, 12 

ordinary least squares, 86 

out-of-sample error, 21 

outliers, 79 

output space, 3 

overfitting, 119—165, 171 
definition, 119 
experiment, 123, 155 
learning curves, 122 


pattern recognition, 9 
penalty 
hypothesis complexity, 126, 133 
model complexity, 58 
perceptron, 5-8, 78-82 
definition, 5 
experiment, 34 
learning algorithm (PLA), 7 
MH (N ), 7O 
PLA convergence, 33 
pocket algorithm, 80 
perceptron learning algorithm, 7, 77, 78, 
98, 109-110 
and SGD, 98 
convergence, 33 
figure, 7, 83 
PLA, see perceptron learning algorithm 
pocket algorithm, 80, 97, 109 
figure, 83 
poll, 19 
Truman versus Dewey, 171 
polynomial transform, 104 
polynomials, 120 
positive interval, 44 
positive ray, 43 
postal scam, 170 
prediction of heart attacks, 89 
probability 
logistic regression, 89 
union bound, 24, 41 
projection matrix, 113 
pseudo-inverse, 85 
numerical stability, 86 
publication bias, 173 


quadratic programming, 181 


random sample, 19 


recommender systems, 1, 15, 181 
regression, 77, 82 
logistic, 89 
regularization, 126—137, 181 
Ein versus A, 156 
augmented error, 132 
choosing à, 134, 149 
input noise, 160 
lasso, 161 
linear model, 133 
ridge regression, 132 
soft order constraint, 128 
Tikhonov, 131, 160 
VC dimension, 137 
weight decay, 132 
regularization parameter, A, 133 
reinforcement learning, 12, 181 
ridge regression, 132 
risk, 28 
risk matrix, 38, see also cost matrix 


sample complexity, 56-57 
sampling bias, 171-173, 181 

versus data snooping, 177 
Sauer’s Lemma, 48 
search engines, 1 
selection bias, 173 
SGD, see stochastic gradient descent 
shatter, 42 
sigmoid, 90 
singular value decomposition, 114 
soft order constraint, 157 
soft threshold, 90 
spam, 4, 6 
squared error, 61, 66, 84, 140 
SRM, see structural risk minimization 
statistics, 14 
stochastic gradient descent, 97-99, 110 
stochastic noise, 124 
streaming data, 12 
structural risk minimization, 178 
superstition, 119 
supervised learning 

definition, 11 
support vector machines, 181 
supremum, 187 
SVD, see singular value decomposition 


tanh, 90 
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target distribution, 31 linear model, 133 

target function, 3 negative A, 156 
noisy, 30-32, 83, 87 optimal à, 161 

test set, 59 virtual examples, 157 

Tikhonov regularizer, 131 

Tikhonov smoothness penalty, 162 2 space, 99-102 

training examples, 4 . 

Truman, 171 


underfitting, 135 

union bound, 24, 41 

unlabeled data, 13, 181 

unsupervised learning, 13, 181 
learning a language, 13 


validation, 137—141 
cross validation, 145 
model selection, 141 
summary, 141 
validation set, 138 
validation error, 138 
expectation, 138 
optimistic bias, 142 
variance, 139 
validation set 
VC bound, 139, 163 
Vapnik-Chervonenkis, see VC 
VC dimension, 50 
d-dimensional perceptron, 52 
and number of parameters, 72 
definition, 50 
effective, 137 
intersection of hypothesis sets, 71 
monotonic functions, 71 
of composition, 72 
union of hypothesis sets, 71 
VC generalization bound, 53, 78, 87, 102 
definition, 53 
proof, 187 
sketch of proof, 53 
VC Inequality, 187 
vending machines, 9 
virtual examples, 157 


weight decay, 132 
cross validation error, 149 
example, 126 
gradient descent, 156 
invariance under linear transform, 162 
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